FIRST,
Full Information Retrieval System Thesaurus Methodology
jach [at] aunmas [dot] com">Juan
Chamero, from Intelligent Agents Internet Corp, Miami USA, August 2001
Abstract
FIRST,
Full Information Retrieval System Thesaurus is a methodology to create
evolutionary HKM’s, Maps of the Human Knowledge hosted in the Web. FIRST point
towards an acceptable “kernel� of the HK estimated in nearly 500,000 basic documents
selected from a exponential growing universe doubling the size yearly and actually
having nearly 1400 million sites. There are many laudable and enormous
scientific efforts made along the idea of building an accurate taxonomy of the
Web and trying to define precisely that kernel. At the moment the only tools we
as users have to locate knowledge in the Web are the search engines and
directories that deliver answers lists ranging from hundreds to millions of
documents being the supposed “authorities� hidden in a rather chaotic
distribution within those lists. That means exhausting searching processes with
thousands of “clicks� in order to locate something valuable, let’s say an
authority.
FIRST create evolutionary search engines that deliver
reasonable good answers with only one click from the beginning. We talk of
reasonability as a synonym of mediocrity because the first kernel is only a
mediocre solution henceforth to be optimized via its interactions with users.
FIRST could be considered also an Expert System able to learn mainly from those
interactions mismatching. So initially FIRST generated kernels could be
considered mediocre one click solutions, for a given culture and for a given
language but able to learn converging to a consensual kernel. To accomplish
that the only that FIRST kernels need are interactions with users. As long as
users represent the whole the more the kernel will tend to represent the
knowledge of that whole. For that reason, we imagine a network of HKM’s
implemented via our FIRST or some others equivalent evolutionary tools. As each
node of this semantic network will serve a given population (or market) we
could easily implement something like a DIAN, Distributed Intelligent Agents
Network to coordinate the efforts made by each local staff de Intelligent
Agents (coopbots). Each node will have a kernel in a different stage of
evolution depending of its age, measured in interactivity, and of its
population profile.
The main differentiation of FIRST from most present knowledge
classification and representation projects rests on the hybrid procedure of
building the mediocre starting solution: a staff of human experts aided by IA’s
and IR algorithms. The reason of this approach is the actual “state of the art�
of Artificial Intelligence, AI. The best actual robots are unable to accurately
detect general authorities and are easy to be disguised, unfortunately by
millions of document owners that either unethically or by ignorance try to present their sites as authorities. Another flaw is
the primitiveness of even the most advanced robots unable to edit
comprehensible synthesis of sites. Otherwise the human being is extremely good
for those tasks, by far more accurate and more efficient.
The map
itself consists of I-URL’s, Intelligent URL’s, brief documents from half to two
pages, describing the sites referenced like pieces of tutorials, classified
along a set of taxonomy variables and tagged with a set Intelligent Tags, some
of them to manage and to track their evolutionary process. For each Major
Subject of the HK, a Tutorial, a Thesaurus, a Semantic Network and a Logical
Tree are provided and bound to the virtual evolutionary process of the users
playing a sort of “knowledge game� versus the kernel.
FIRST is presented here within a context
of the IR-AI “state
of the art�. The methodology has been tested to build a HKM in 120 days. Time
is a very important engineering factor due to the explosive expansion of the
Web and because its inherent high volatility. The task
performed by the human experts staff is similar to the
task of providing, to a Knowledge Expert System, the basic knowledge to “play�
a Game of Knowledge reasonably well versus average Web users. Resembling the
beginnings of the Big Blue that beats Kasparov: it initially should have been
able to beat not a master but at least a second category chess player (with a
reasonable good ELO standard) and from that the evolutionary path through three
six levels more: first category, master, international master, grand master,
championship.
Content Index
The solution
2- About a New
Approach to Internet Communications
Jargons Evolution
3- FIRST, Full
Information Retrieval System Thesaurus
The power of the right statistics
4- i-URL’s and Intelligent
Databases
Advantages of our i-Virtual
Libraries
Notes
5- Evolutionary Process - Some Program Analysis Considerations
6- Noosphere
Mechanics – Evolutionary Sequence
7- An Approach
to Website Taxonomy
8- FIRST within the
vast world of AI – IR
FIRST niche
New and Old ideas in action now
Clustering: Vivisimo and Teoma
1- The Future of Cyberspace
The Web space and the Noosphere[1]
You may find 30,136 pages dealing
with “noosphere� in Altavista at 2.22 PM Eastern Time for USA and Canada on Thursday
12th of April 2001. This is a rather strange word for many people
that did not deserve an entry in the Merriam Webster online dictionary yet.
However we know, use and enjoy the Cyberspace, concept that at nearly the same
time deserves as many as 777,290 entries in the same Altavista, but on the
contrary it has an entry in Merriam Webster since 1986, with the following
meaning: the on-line world of computer
networks. Web space is another neologism not yet included in that dictionary
but deserves 485,805 entries in Altavista.
The Web
space growths at a fantastic pace holding today nearly one and half billion of
documents, ranging from Virtual Libraries and virtual reference e-books dealing
with the Major Subjects of the human knowledge through ephemeral news and
trivial virtual flyers generated “on the fly� at any moment continuously. We
may find in the Web documents belonging to any of the three Internet major
resources or categories: Information, Knowledge and Entertainment.
id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t"
path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f">
height:138pt'>
In the
above figure the black crown represents the Web space and the green circle the
users. The gray crown represents an intermediate net to be built in the near
future with intelligent resumes of the Human Knowledge, pointing to the Web
basic documents and e-books. One user is shown extracting a “cone� of what
he/she needs in terms of information and knowledge. The intelligent resumes
must be engineered in order to be good enough as introductory guides/tutorials
with a set of essential hyperlinks inside. If the user wants more detail goes
then directly to the right sources within the black region. Depending of the
Major Subject dealt with the user may go from resume to resume or jumping to
higher level guides inside the gray region going to the black region only to
look for specific themes. Moreover many users will be satisfied browsing within
the gray region without even venturing into the black region.
Another
user goes directly to the black region guided by aid of classical search
engines as now. The black region will be always necessary and its size will
grow fast as time passes by. On the contrary, the gray region will fluctuate
around a medium volume growing at a relatively very low rhythm. Effectively,
the Human Knowledge “kernel� of basic documents is almost bound, changing its
content but always around the same set of Major Subjects. The growth of the
gray region is extremely low in comparison to the black region. Some Major
Subjects die and some others are born along the time but slowly.
For more Web sizing information see our Chapter 8 about The
Vast World of AI-IR
As a
science fiction exercise we invite you to make some calculations resembling
some Isaac Asimov’s stories and Carl Sagan’s speculations. Being the actual
Human Knowledge bound to let’s say 250 Major Subjects or Disciplines and if for
each of them we define a Virtual Library with non redundant 2,000 e-books, in
the average, we will have a volume of 500,000 e-books. Now we could design a
methodology to synthesize an intelligent text resume for each e-book in no more
than 2,000 characters, in the average, totaling 1,000 MB ó 1 GB storing one character in one single byte. That would
be the volume of the gray region!, not too much
really!.
Let’s then
compare this volume to the volume of the black region and to the volume of the
resources of the Human Knowledge. Once upon a time, there were a Web space with
one and a half billion documents with an average volume estimated in 2.5 MB (we
have documents ranging from 10KB and less to 100MB and more: to get that figure
we supposed the following arbitrary size series 1, 10, 100, 1,000, 10,000,
100,000 in KB and we assigned to each term the following arbitrary weights:
.64, .32, .16, .08, .004, .002 respectively). Then we have a volume of nearly
3750, 000,000 MB!. Within that giant space float
disperse the basic e-books, the resources of the Human Knowledge with an
estimated volume of nearly 500,000 MB assigning 1MB to each one, half a million
of text and 100 images of 5KB in the average.
Black
Region: ~3,750,000 GB => HK ~ 500 GB => Grey Region ~ 1 GB
Incredible
result that demonstrates how easy will be able to compile a rather stable HKIS,
Human Knowledge intelligent Summary in relation to the unstable, noisy,
bubbling, fizzy and always growing black region. Once the effort is done the
upgrade will be facilitated via Expert Systems and a set of specialized
Intelligent Agents that will detect and extract from the black region only the
“necessary� changes.
In the figure above we depict the
actual Web space in black, resembling the physical space of the Universe. No
doubt the information we need as users is up there but where?.
That virtual space is really almost black for us. Some members of the
Cyberspace that provide searching services titled as Search Engines and/or Web
World Wide Directories are like stars that irradiate light all over the space
to make sites indirectly visible. Sometimes we may find quite a few sites with
their own light, like stars, activated by publicity in conventional media but
the rest is only illuminated by those services at users’ request. Let’s go
deepen a little about the nature of this singular searching process.
For each
resource (body) located in the Web space in an URL, which stands for Uniform
Resource Locator, robots of those lighting services prepare a brief summary
with some information extracted from it, no more than a paragraph and then all
the information collected goes to their databases. The summaries have attached
to them some keywords extracted from the resources visited and consequently are
indexed in as many keywords as they have attached.
The actual
robots are very “clever� but extremely primitive compared to human beings. They
are doing their best and they have to perform their work fast in fractions of
millisecond per resource as well so it would be unpractical being more
sophisticated because the time of “evaluation� grows exponentially with the
level of cleverness. To facilitate the robots work the Website programmers and
developers have at hand wise tools but many of them overuse those facilities so
badly to make them unwise. In fact with those tools the programmers could
communicate to the robots some essential information the site owners wish to be
known about the site.
Those wise
gateways are now noisy because most people try to deceive the robots
overselling what should be the essential information. Why do they that?. Because the Search Engines must present the sites listed
hierarchically, the first the best!. It occurs
something like in the Classified Section of the newspapers: the people wishing
to be listed first unethically make nonsense use of the first letter of the
alphabet: AAAAAAA Home Services go first that for instance AA Home Services.
The Search Engines do not have too much room to design a “fair� methodology to
rank the sites with equity and Internet is a non-police realm besides
One trivial
criterion should be to count how many times a keyword is cited within the
resource but that proved to be misleading because the robots only browse the
resource partially being practically impossible to differentiate a sound
academic treatise from a student homework concerning the same subject. To make
the things worse, programmers, developers, and content experts know all those
tricks and consequently they make overuse of the keywords they believe are
significant.
The Search
Engines have improved too much along the last two years but the searching
process continues being highly inefficient and tends to collapse. To help site
owners to gain positions within the lists (in fact to get more
light) proliferate ethical and unethical techniques and programs most of
them apt to deceive the “enemy�, namely the Search Engines. Even in a ‘Bona
Fide� utopia it’s impossible for a robot to differentiate between a complex
site and a humble site dealing with the same subject. Complex sites
architectures could even make the sites invisible for them because they are
only well suited to evaluate flat and simple sites. For instance search engines
like Google needs also to break even commercially and start selling pseudo
forms of score enforcing ways to desperate site owners that need traffic to
subsist.
We
emphasize again the fact that the “light� that a Search Engine provides to each
URL is indirect like the Moon reflects the Sun’s light. Then our conclusion is
that most of the information and the knowledge is
hidden in the darkness of the Cyberspace.
The Matchmaking Realm
style='width:216.75pt;height:162pt'>
Now that we
know the meaning of the HK Human Knowledge we may define HKIS, the Human
Knowledge Intelligent Summaries, a set of summaries that we have to explain
soon why do we title them as intelligent, and NHKIS, for a Network of Human
Knowledge Intelligent Summaries that correspond to the gray crown of the above
figures. Now we are going to enter into the problem of the languages and
jargons spoken in the Black Region, in the Gray Region and mainly in the Green
Region.
Internet the Realm of Mismatch
The
Websites are built to match users, are like lighthouses in the darkness, to
broadcast information, knowledge and in the case of e-Commerce some kind of
attracting information as “opportunities�. What really happens is that at
present Internet is more the Realm of Mismatch than of Matching. The
lighthouses owners cannot find the users and the users neither cannot find the
alleged opportunities nor understand the broadcasted messages. This mismatching
scenario is dramatic in the case of Portals, huge lighthouses created to
attract as many people as possible via general interest “attractions�.
Something
similar occurs with the databases where are stored millions of units of
supposedly useful information such as catalogs, services, manufacturers,
professionals, jobs opportunities, commercial firms, etc: users could not find
what they need. When we are talking of mismatch we mean figures well over 95% and
in some databases matching efficiencies lower than 0,1%.
In the
figure above we depicted this dramatic mismatch. The yellow point is a Website
with its offer represented by the cone emerging from it, let’ say the Offer
expressed in its language and in its particular jargon. A point black within
the green circle represents a user and the cone emerging out from it his/her
Demand expressed also in his/her language and particular jargon.
Mismatch reasons
Websites and user speak and think different
What we
discovered is that both sides speak approximately the same language but by sure
different jargons and more than that, they think different!.
We have depicted the gray crown because the portion corresponding to its Major
Subject virtually exists: that’s the portion in dark gray within its cone. They have the “truth� expressed in its
particular jargon, and sometimes the “official� and standard jargon. If the
Website were for instance a “Vertical� of the Chemical Industry, of course its
jargon will then be within the Chemical Industry Standards and its menu should
be expressed technically correct, resembling the Index of a Manual for that
particular Major Subject: Chemical Industry.
So our
conclusion of a research done along two years studying the mismatch causes was
that the lighthouses speak -or intend to speak- official jargons, certified by
the establishment of their particular Major Subjects. They are supposed having
the truth and they think as “teachers�, expressing their truth in their menus
that are in fact “logical trees�. They may allege to be e-books and they
behave, think, and look, pretty much the same as physical books.
Now let’s
analyze how the users act, express and behave. If one user meets the site to
learn, the cones convergence is obliged, the user is forced to think in terms
of concepts of the menu that for him/her resembles a program of study, and we
have a match scenario. If the user meets the site to search something, that’s
different. When one goes to search something one tends to think in keywords
terms instead, keywords that belong to our own jargon and at large to our own
Thesaurus. So, either by ignorance or on the contrary, being an expert, the
users’ cones diverge substantially from the site’s cone. One of the main
reasons of this divergence is that the site owners ignore what their market
targets need. Many of them are migrating from conventional businesses to
e-Commerce approaches and extrapolate their market know-how as it is. They were
working hard along decades to match their markets and to establish agreed
jargons and now they have to face unknown users coming virtually from all over
the world.
Evidently the solution will be the
evolution from mismatch to match in the most efficient way. To accomplish that,
both the Offer and the Demand, have to approximate each other until both share
a win-win scenario and a common jargon.
style='width:207.75pt;height:155.25pt'>
In the figure above we depict a
mismatch condition where we might distinguish three zones: the red zone
represents the idle and/or useless Knowledge; the gray zone corresponds to the
common section with an agreed Thesaurus concordance; and the blue zone
corresponds to what the users need, want, and apparently does not exist within
the site. So the site owners and administrators have three lines of action: a)
reduce to zero the red zones, for instance adapting and/or eliminating supposed
“attractions�; b) learn as much as possible about the blue zone, and; combine
both strategies.
At this moment the dark green
zones are extremely tiny, less than 5% being Internet the Realm of Mismatch
between Users’ Demand and Sites’ Offer. The big efforts to be done consist in
minimizing costs eliminating useless attractions and learn from non-satisfied
Users’ needs. To accomplish both purposes the site owners need intelligent
tools, agents that warn them about red and blue events.
What’s does Intelligent mean
Let’s analyze the basic process of
users-Internet interactions. One user meets one site to interact in one of
three forms some times concurrently: investing time, making click over a link
or filling a form or a box with some text, for instance to make a query to a
database. The site statistic are well prepared to account for clicks, telling
what “paths� were browsed by each user but they are not well suited to account
for interaction derived from textual interactions. Of course, you may record
the queries and even the answers but that’s not enough to learn from
mismatching. To accomplish that we may create programs and/or intelligent
agents that account for the different uses over the components of each answer,
but they have to do then a rather heavy accounting.
If we query a commercial database
for tires the answer would be a list of tires stores; and to have statistics
about how frequent the users ask for this specific keyword we need to account
for it; and to know about the “presence� of each store as a potential seller we
need to account for it; and if we want to know about the popularity of each
store we need to go farther, accounting for it and so forth. That accounting
process involves a terrific burden even done in the site server’s side.
An intelligent approach should be
to have all possible counters to detect documents popularity and users’
behavior, built in into the data to be queried. That’s the beginning of the
idea: to provide a set of counters within the data to be queried by users for
each type of statistic. So when a data is requested a counter is activated
accounting for the presence, and when it is selected by a click
another counter is activated and when the user by reading the “intelligent
summary� received decide to make a click over the original site or over
one of its inner hyperlinks, another counter is activated.
id="_x0000_i1028" type="#_x0000_t75" style='width:291pt;height:218.25pt'>
Here is represented a typical
track of user-site interaction. The user makes a query for “tires�. The i-Intelligent Database answers sending all data it has
indexed by tire adding a list of synonyms and related keywords it has for tire.
Each activated i-URL accounts its presence in that answer
adding one to the corresponding counter in the i-Tags zone. If the user makes
click on a specific i-URL the system presents it to the user accounting for
this preference in another counter of the i-Tags zone.
Finally if the user decides to
access the commented site located in the black crown makes a click and another
counter is activated within the i-Tags zone. At the
same time the counter corresponding to the keyword tire is activated adding one
and the same if the user activates some synonym or related keyword. If the
answer is zero data it means a mismatch because an error or a warning about a
non-existent resource within the database. In both cases the system has to
activate different counters for the wrong or non-existing keyword in order to
account for the popularity of this specific mismatch. If the popularity is high
it is a warning signal to the site Chief Editor (either human or virtual) about
the potential acceptance of the keyword, either as a synonym or a related
keyword. At the same time, the system may urge to look for additional data
within the black region. From time to time the systems could suggest the
rehearsal of the i-URL’s summaries database in order
to assign data to the new keywords as well. We will see how to work with a
network of these Expert Systems at different stadium of evolution.
Within the intelligent feature we
consider to register the IP of the users interactions
and the sequence of queries, normally related to something not found. The
keywords users’ strings are in their turn related to specific subjects within
the Major Subject of the site. So, statistically, the keywords strings analysis
tells us about the popularity of the actual menu items and suggests new items
to be considered.
Some examples about
actual general search inefficiency
Let’s try
to search for something apparently trivial like “Internet statistics�, for
instance using one of the best search engines, Google: More than 1,500,000
sites!. Do not dip too much along that list, only check what the first 20 or 30 sites offers. Most
of the content shown by the sites of that sample is obsolete and when updated
you are harassed by myriad of sales offers about particularly statistics,
market research studies and similar, priced on the thousands up. And if this
scenario occurs with supposed authorities: Library of Congress, Cyberatlas,
About.com statistics sites, Internet Index, Data Quest, InternetStats, what
then with the 1,500.000 resting?.
What if
that noisy cluster be replaced by a brief comment made by a statistician,
telling the state of the art about Internet Statistics and suggesting
alternatives ways to compile statistics from free updated authorities that by
sure exist in the Web?. That’s is very easy to do and
economic either, it should take no more than one hour of that specialist. Of
course that would be feasible as a permanent solution if the cost of updating
that kind of reports were relatively insignificant. Concerning this problem we
estimated that the global cost for updating a given HKM is of the order of 3%
to 5% per annum the cost of its creation. So the HKM’s will be updated by two
ways: evolutionary by evolution through their interaction with users and
authoritative by human experts updates.
Let’s see another examples with “sex� and “games�. Sex has more than
48,000,000 sites and is well known that the sources of sexual and pornographic
content are fewer than 100. The rest are speculators, repetitions, transfers,
and commuting sites of only one click per user playing the ingenuous role of
useful idiots. Something similar occurs with games with more than 35, 000,000
sites and again the world providers of games machines, solutions, and software
are no more than 100!.
For a given
culture and for a given moment we have the following regions in the Web space
:
style='width:102.75pt;height:162.75pt'>
Red: a given HKM
Black Blue: HK Virtual Library
Regular Navy Blue: Ideal HK
Blue: Ideal HK plus New Research
Light Blue: Ideal HK plus NR plus Knowledge Movements
Deep Light Blue: Ideal HK plus NR plus KM plus Information
Everything
is working within an expanding universe of Human Intellectual Activity. It
takes too much time and effort for new ideas and concepts to form part of the
Ideal HK. We as human have two kinds of memory, semantic and episodic, and any
cultures in a given moment have its semantic memory, conscious and unconscious,
intuitive and rational as well as its episodic memory.
Along the
human history the dominant cultures have controlled the inflow of the Human
Intellectual Activity in explicit and implicit ways, for instance discouraging
the dissension. Internet allows us as users to dissent with any form of
“established� HK and to influence on an equality basis the allegedly ideal HK.
This feature will accelerate in an unprecedented way the enrichment of the
ideal HK. For that reason we emphasize in FIRST the mismatch between the HKM
and users thoughts, questions and expectations, oriented to satisfy users, that
is the human being as a whole and as a unit.
2- About a New Approach to Internet Communications
Linguistic Approach
We make
specific reference to Internet Data Management because the “Big Net� differs
substantially from most nets. Internet deals with all possible groups of people
and all possible groups of interest. Internet users belong to all possible
markets from kids to old people in all possible economic, social and political
levels and cultures. This Universality makes the Internet man-machine
interactions extremely varied.
On the
contrary, in any other network we may define a “jargon�, ethic and rules. When
we build a new Internet Website we really ignore what will
our potential users be, and consequently what they want, what they need
and even we ignore their jargons. We imagine a target market and for that
specific market we design the site content, in fact, the “Information Offer� to
that market.
style='width:219pt;height:165pt'>
The figure
above depicts the matchmaking process within the Internet “noosphere�. The
users in green express what they want and even think in terms of “keywords�,
expressed in their own jargon, are open and flexible. On the contrary, the
Website owners through their sites believe they have the truth, only the truth
but the truth. In that sense being or not an authority they resemble “The Law�
of the establishment of the Human Knowledge. The law, for each Major Subject is
expressed in Indexes of the main branches of that Major Subject, resembling a
“Logical Tree�, depicted in gray over the yellow truth. They imagine their
sites as universal facilitators but always following the pattern of the logical
tree and expressed in their jargons.
The Websites
have their own Thesaurus, set of “official� keywords, depicted in white over
black background, within the darkness of the Web space. Between the logical tree
and the Thesaurus exists a correspondence. The Website owners are shown with
the Truth Staff in yellow. The users-Internet interactions are depicted as a
progressive matchmaking process, going from green to black and vice versa,
learning one from the other match-mismatch. Both sides strive for knowing
interchanging knowledge
Paradoxically,
even being the Web so well suited to add, to generate and to manage
intelligence most people ignore this fantastic possibility. If we define our
Information Offer as WOO, which stands for What Owners Offer and what
the users want by WUW, which stands for What Users Want, the Web
Architecture permits the continuous match between them and as a byproduct the
intelligence emerging from any mismatch.
That
possibility means the following: WUW is what users want expressed in their
specific jargon/s, meanwhile WOO is the Website
information offer expressed in let’s say the “official/legal� jargon, the one
we choose to communicate with our target market. The continuous mismatch
between WOO versus WUW would permits us to know the following five crucial
things:
· What the Market wants
· The Market major
characteristics
· The Market homogeneity
and/or its segmentation
· The Market jargon/s
· The Market needs.
The
knowledge of the market jargon/s permit us to optimize our offer: for instance,
a negative answer to an user query could mean either
that we don’t have what he/she wants or that the name of what he/she is looking
for in his/her jargon differs in our jargon.
What we
know directly from users queries is what they want, not what they need. The
difference between WUW and WUN, What Users Need is substantial.
People generally know what they need but adjust their needs to the supposed or
alleged Website capabilities. We learn what our users need as time passes by if
we make use of the intelligence byproducts and/or from surveys.
The IO is
normally presented as ordered sets under the form of Catalogs, Indexed Lists
and Indexes but the queries, where the users express their particulars needs
WUN are expressed by keywords. Both communication systems are completely
different even though could be complemented and we could make them work
together towards the ideal match between WUN versus WOO.
As we see soon the users
communicate with the different Websites via their subjective jargons, at least
as many jargons as MS, “major subjects� they are interested in. For instance,
if I’m an entrepreneur that manufactures sport car wheels I’m going to query
B2B sites to look for subjects related to the sport car wheels expressing
myself in “my� jargon, with differences with the “official� jargons used in the
B2B sites and of course, the query outcomes will strongly depend of the jargons
differences.
In a
similar way as the official languages change from time to time, influenced at
large by the pressures of the people jargons, coexisting both at any time, we
may endow an extremely efficient evolutionary feature to the Websites of the
Cyberspace via Expert Systems, that learn from the
man-Internet interactions. We dare to qualify this feature as extremely
efficient because in the Cyberspace every transaction could be easily and
precisely accounted for. So, each time one user uses a keyword belonging to
his/her jargon this event could and should be accounted for.
Let’s then
imagine what kind of intelligent byproduct could we extract of this simple but
astonishing feature. Within a homogeneous market the keywords tend to be the
same among their members. So in our lat example, if the majority of users make
queries asking for wheels and the word-product wheel does not exist in our
database a trivial byproduct takes the form of the following suggestion: add
wheels to the database as soon as possible. On the other hand if the
word-product “ergaston� was never asked for along a considerable amount of
time, another trivial message should be: take ergaston out from the database.
style='width:174.75pt;height:131.25pt'>
The figure above depicts the
evolution of the matchmaking process. In the beginning, the Website owners had
the oval green-gray target, where one user is shown with a black dot. But that
user really belongs to a users affinity market
depicted as a dark green oval with a cone of Internet interest that differs too
much from the ideal initial target. The Website owners need an intelligent
process to shift towards the bigger potential market dark green. With a cone
border yellow we depict the final “stable� matchmaking.
3- FIRST, Full Information
Retrieval System Thesaurus
The Cyberspace actually has about
1,500 million documents ranging from reference to trivial, from truly e-books
dealing with the major subjects of the human knowledge to daily news and even
with minute to minute human interactions information as in the case of
Newsgroups, Chat and Forum “on the fly� pages generation. This information mass
grows continuously at an exponential rate, rather chaotically, as its
production rate is being by far exceeded by the human capacity for filtering,
qualifying and classifying it.
To help the
retrieval of information from the Cyberspace we make use of Search Engines and Directories
that are unable to attain WUN, What Users (We the Humans) Need. From all
that information mass the search engines offer to us “summaries�, telling what
kind of information could we get in each location of
the Cyberspace (the URL, Uniform Resource Locator). So for each URL we as users
obtain its summary. Those summaries are normally written by the Search Engines
robots, which try to do their best extracting pieces of “intelligence� from
each Cyberspace location.
style='width:222.75pt;height:167.25pt'>
In the figure we depict some sites
within the darkness of the Cyberspace. We may find from huge sites storing
millions of documents and with hundreds of sections through tiny sites with a
flat design storing a few pages. One Search Engine shown as a yellow crown
sends its robots to visit existing sites from time to time making a brief
“robotic� summary of them. As we will see soon those brief reports are noisy,
deceiving the users (green circle). The Search Engine assigns priorities, which
act in turn as a measure of the site magnitude (as the brightness of a star).
As it’s depicted, the priorities (the navy blue dots) have nothing to do with
the real magnitude of the site (depicted as the white circle diameter). So the
yellow crown is a severe distortion of the Web. These priorities defined for
the keywords set of a given site resemble the “light� that illuminates it: a
high priority means a powerful beam of light reflecting over the site
highlighting it to the users sight.
The actual
information provided for the search engines are as primitive as the map of the
sky we had one thousand years ago. The robots only detect some keywords the
site content have, equivalent to the chemical elements of the celestial bodies,
but tell us nothing about its structure, type of body and magnitude. Today we
may have for each celestial body the following data:
Among many others, diameter, density, its constitutive
elements spectral distribution, brightness, radiation, and albedo. For each of these variables we have site equivalents that
must be known in order to say that we have a comprehensible Cyberspace map. For
instance we need to know something that resembles magnitude, density and
elements distribution and brightness.
Being the
bodies of this cultural and intellectual space (noosphere), intellectual
creatures, we need an intellectual summary of it, what is known as the abstract
in essays and research papers. For
instance a site could be camouflaged to appear attractive emphasizing the
importance of a given element, let’s say “climate�, to deceive a robot as being
a specialized climate site but in reality having nothing about climate content.
The same happens with information: Portal’ news, for instance, are presented as
content sites, being that true only concerning a specific type of information
resource known as “news�, of an extremely ephemeral life of hours. On the
contrary, content of philosophy or mathematics are by far denser, heavier, with
lives lasting centuries in the average. So we could distinguish all kind of
bodies from fizzy (news) to rocky (academic).
Another
complementary source of information are the Databases
hosted as collateral of the Websites, as huge stores of organized and
structured data. The content and quality of these databases are normally a
subjective “bona fide� declaration made by the Website owners. So far for the
users, the Cyberspace, particularly the Web Cyberspace, looks like a net of
information resources with some “Indexes� to facilitate their retrieval task.
Those robots made indexes are too noisy being practically useless. Below we
attach a well-known graphic sample of this uselessness
type="#_x0000_t75" style='width:261.75pt;height:217.5pt'>
The figure depicts
the finding of useful information (black spots) navigating along a searching
program
The main
reasons are, among many:
Increasing Websites Complexity: Robots could not cope with the Website increasing
complexity. Robots are unable to evaluate properly sites like the ones
belonging to the NASA, World Trade Organization, and the Library of Congress,
only to mention some institutional, concerning Aerospace, Commerce, and General
Knowledge respectively, and cannot differentiate them from trivial sites
dealing with similar subjects.
Inability to cope with Human Stratagems: Robots are unable to detect and to block some subtle
overselling stratagems made by the Website owners to position themselves high
in the Search Engines answers to users queries.
Linguistic Problems:
Robots could not cope with the increasing number and complexity of the
languages and jargons used in the net. They make their work using rather naïve
Thesaurus, only modified and enriched via the Website owners’ declarations, not
as it should be via the users feedback. As a
consequence of that bias the Search Engines speak the owners
jargons instead of the users jargons.
In brief,
the shadows of content that search engines offer to the users have almost
nothing to do with the real content of the Cyberspace, presenting a distorted
vision of it. The problem is the contagious spread of this distortion as long
as the Website owners use that summary information as a “bona fide� vision of
its world. As a corollary, Internet speaks today the Website owners jargons
pointing to a global distorted visions of the real markets!.
Uselessness
Measure
The
mismatch measure between WUN, What Users Need and WSO, What
Search-Engines Offer, should be one of the first priorities of scientific
institutions interested in the Internet health. However, almost everybody is
well acquainted of this abysmal mismatch and you may check it by yourself very
easily making random queries about any subject. We, as a private research
group, made our own investigations about that global mismatch finding the
following figures:
Mismatch of WSO versus WUN is
within the order of 6,000 to 1
Meaning that we, as ordinary users, searching through the
Cyberspace with the help of outstanding search engines, in the average, have to
browse through 6,000 summaries to find 1 potentially matching our needs.
Searching
information stored in Databases proved to be a tough task as well. Students of
Systems Engineering coursing the last year of their career in the Instituto
Tecnológico de Monterrey, Mexico, were invited to freely query a commercial
tested (2) Database being the mismatch greater than 99,9%, that is,
they needed in the average more than 100 queries to match a product/service
stored within the database. The main reason of the mismatch was not due to
missing information in the database but to linguistic problems. That was a
warning sign and we investigated some other commercial databases belonging to
well-known B2B sites with similar results.
Note 2: By “tested� we mean
that the content was checked before the trial. The information existed but the
students were unable to find what they were searching because linguistic
problems.
The
abysmal and chaotic mismatch enable forms of e-Commerce delinquency: When you as a user
face that finding your first reaction could be being suspicious of the declared
content of the database. On the owners’ side, they could allege that those
mismatches are due to linguistic ignorance of the users. Unfortunately there is
not something like an official audit to detect deceptions yet but we’ve found
many databases really empty betting to growth via users
membership with cynic declarations such as:
Come to join us! , We already have one million firms like
yours!,
Our
methodology started as an effort to solve some Internet drawbacks Websites
owners and users experimented, mainly within the dot COM domain. Concerning
that, our Systems Engineering background warned us, and we were aware of, that
the crisis was the "Internet answer as a system" to wrong approaches
of most Internet newcomers. At large, Internet is a Net of computers and
servers obeying the rules of IT and Communications. What happened along the
last two years within the dot COM domain should have been a sort of science
fiction for traditional IT and Communications companies. But finally the waters
will find their natural courses.
Along that
reasoning we were confident that the solutions to some of the Internet
drawbacks should be found within classical systems engineering wisdom. Within
that wisdom were classical concepts like Information Retrieval Systems,
Selective Dissemination of Information and Expert Systems. Firms like BM have a
long history on those milestones. As I
can remember KWIC Keywords In Context, SDI Selective
Dissemination of Information, recently Taper Web semantic methodology and The
Big Blue that beat Kasparov run along these lines of research
The first two were respectively a
tool and a methodology to retrieve and to disseminate information efficiently
taking into account the different "jargons" of the Information Offer
and of the Information Demand, belong to the users
realm. That was a subtle differentiation that defies the pass of time. In fact,
Internet is, among many others things, an open World Market that tries to
captivate as much people as possible talking different tongues and different
jargons.
A jargon is
a practical subset language to communicate among people, for instance between
buyers and sellers, but it takes many years to get to a tacit agreement
concerning definitions. For instance, the equivalent of "tires" in Spanish could be neumáticos, gomas, cubiertas, ruedas, and hules being an agreement to consider only neumáticos as the
formal equivalence of "tires" and the rest as synonyms.
The
mismatch between offer and demand could be depicted as follows:
WOO
ó
WUN
Which
stands for match/mismatch between WOO, What Owners Offer versus WUN,
What Users Needs. Internet will be commercially useful as long as WOO
approaches as much as possible to the always-changing WUN.
style='width:161.25pt;height:111pt'>
Let’s
advance a little in the user side. We may differentiate among the following
user satisfaction levels:
WUN > WUW > WUS > WUG > WUL
Where:
WUW stands for What
Users Want, generally restricted to users expectations
about the full capability of Offering;
WUS stands for What
Users Search, restricted by the explicit/intuited site limitations;
WUG stands for What
Users Get;
WUL stands for What
Users Loose in terms of potentially available information
style='width:247.5pt;height:170.25pt'>
So being
submerged in the mismatch we must learn as much as possible of it!. The Information Theory tells us that
mismatching deliver to us by far more information about the "other
side" than matching, in our case information about the markets. Studying
carefully the mismatch we could attain a convergent solution to our mismatch
problem as well.
In order to
accomplish that aim we need systems that learn from mismatching as much as
possible. With this idea in mind the whole problem could be stated as follows:
If our
first offer to the market is WOO_1 we must find a convergent process
such as
WOO_1 - WUN_1 > WOO_2 - WUN_2 > ......WOO_i
- WUN_i >.........
Where the inequalities converge to zero, exponentially if
possible. That is what an Expert System
does provided we may found a reasonable first approach to the market needs WOO_1,
the first iteration of a continuous evolutionary process. We were talking about
to learn but we are to define what from are we going to learn. We are going to
learn from users ó
Websites interactions. Additionally we must create a methodology and
programs able to interpret what the (-) minus sign means in those inequalities
and how do we step up from iteration to iteration.
style='width:170.25pt;height:150.75pt'>
No, definitively
no!. The search engines are extremely useful and this
fact is going to be the same in the future. We are going to need search engines
that cover the whole Cyberspace, as a virtual summary of the Noosphere (3)
or the World Sphere of the Human Knowledge. These World Summaries Databases
will be as now the best Indexes of the Human Knowledge in Internet, not
appropriated to use directly by ordinary users but for Website Engineers and
Architects.
Note 3: the sphere of human
consciousness and mental activity especially in regard to its influence on the
biosphere and in relation to evolution
For each
major subject of the Human Knowledge we are going to need specialized Websites
with almost 100% proprietary content and where ordinary users -looking for
subjects within a given major subject - will be able to navigate in “Only one
click YGWYW, You Get What You Want scenario�. That is, they will
find exactly what they are looking for in only one click of their mouse. To
accomplish that the Content Engineers must provide for each major subject a
satisfactory initial information offer WOO_1. And we have to ask
ourselves: where from are we going to get that initial content locations?. The answer is trivial: from the search engines databases.
Once we
implement this satisfactory initial offer our FIRST methodology via
its Expert System will start to learn from mismatching, adjusting the site
offer to the user needs and only by exception querying the Search Engines
databases when new content is needed. The exceptions are triggered by
non-satisfied users demand. We will see next how to create intelligent
summaries and how could we obtain a progressive independence of the Search
Engines.
To start a
convergent process to approach to our real target we need a reasonable good
starting WOO_1. To accomplish that we designed a search methodology of
three steps depicted in our section devoted to how to create i-URL’s Databases,
Intelligent URL’s Databases. To understand the global methodology it is only
necessary to accept that WWO_1 is equivalent to our first VL, Virtual Library, that is our first credible Index of links pointing
to a set of basic e-books and documents representing our best initial approach
to a given major subject.
Let’s
suppose we were dealing about a Veterinary Portal addressed to Professionals.
Our first VL will have from 1,000 to 1,500 links pointing to the basic e-books
(most of them authorities) and documents with the “necessary and sufficient�
information veterinary professionals will presumably need.
The task of
building WOO_1 is heavy and must be either performed or controlled by experts
in the given major subject. Our strong hypothesis is that the practical human
knowledge could always be packed in finite volumes of e-books and documents,
ranging from 500 to 3,000, dealing with the basic subjects of the major
subject. Concerning that you may verify by yourself whether you could imagine a
physical specialized library with more than 3,000 different books!. Even within the academic context is hard to find
specialized Library sectors with more than that.
Another
fact to be taken into account to proceed with the understanding of the global
methodology is that once a given major subject is considered an established
discipline it’s classified following a hierarchy like a tree, with subjects,
sub-subjects, sub-sub-subjects and so on. That’s the way we humans communicate
among us, that’s the Law for that particular discipline, the established path
to learn it, to be certified as a professional as well.
On the
contrary, we humans as users of a given discipline are evolutionary beings, we
change, we improve, and sometimes we go farther the boundaries of our actual
discipline. Concerning VL’s we are prone to query them not by subjects but via
keywords. So for each discipline, for each niche of the human knowledge we may
define an Index and a set of keywords, both expressed as a jargon. The index
and the content of the corresponding e-books and documents are expressed using
the keywords of the set. If the index is analytic enough all the keywords of
the set will be used at least once.
Then we have
a rather rigid Index resembling the “Law� for a particular branch of human
knowledge and a set of keywords. The keywords set is a living thing: some
keywords become either less or more important as time passes by and even some
of them could disappear of the users’ jargon. Some new keywords are created so
far and at large, if they are used consistently, they must be incorporated.
Finally the keywords evolution must suggest changes in the “old Law� as well.
With all these elements in mind we may step then to the core of our global
methodology.
WOO_1
ó
VL_1ó
[I,K]_1
Where [I,K]_1 is the first pair Index, Keywords, namely the initial
Index presented by the site with all available documents indexed by the initial
keyword set. The keyword set has in turn three components:
K = (Ko, Ks, Kr)
Where Ko stands for the “official� keywords, for instance the
standard to describe a particular product or service, Ks stands for all the
possible and accepted synonyms and Kr stands for the related keywords, defined
to help the users’ search.
Initially
the jargon of the site will be the owners jargon or the best linguistic
approximation made by the owners to interpret the market and progressively the
site jargon will approximate to the real market jargon.
The
Thesaurus is K plus the corresponding keyword definitions. For Merriam Webster
a Thesaurus is:
“a: a book of words or of information about a
particular field or set of concepts; especially : a book of words
and their synonyms b : a list of subject headings or descriptors
usually with a cross-reference system for use in the organization of a
collection of documents for reference and retrieval.�
The
power of the right statistics
As each type of information has a given statistical life (4)
it’s very important to dose them wisely in order to keep the maintenance cost
low. To offer an optimal information portfolio we must know the users’
preferences in detail. The classical statistics tell the Website owners how
their users browse their information resources in terms of “paths� statistics.
What we offer goes a little farther: what for the users go to a particular path
instead.
Let’s suppose that users go frequently to a given path
because its title suggests too many things. The people go there and find
nothing. How do we detect the natural deception?. On
the other side we may have solitary paths with powerful and useful content for
the users but its title suggest nothing. We must
realize that users think in terms of keywords in their own jargons, so we must
orient our offer in the same direction. Our i-URL’s
Databases are designed thinking in that way. Each document has an editorial
brief telling what the site is and what the site offers using a proprietary
taxonomy system, a set of keywords and a set of i-Tags, Intelligent Tags,
registering the whole life of it. For FIRST each query deserves maximum
attention accounting for each type of user reaction, namely: ignores it; browse
along the list; make click over at least one link; communicate with the
Webmaster (enabled in each query); etc. And, once a user has selected a summary
the system account whether the user selects or not the summarized document.
Note 4: We were talking about “life� and
effectively each piece of information has a given life, following exponential
functions of the elementary type e-lt where 1/l is the mean life of the information
i-URL’s Databases
i-URL stands for
Intelligent Comment about a given Website located in the URL address. Everybody
knows how those comments look like when delivered by search engines but everybody
knows how frequently useless they are!.
In fact you
may spend hours looking for something useful, even being an expert Web
navigator. Some confidential estimates about this unfruitful and heavy task
tells about efficiencies below 1: 5,000, meaning that to find what we are
looking for (the 1) we have to browse over at least 5,000 of those comments, in
the average. Concerning databases we talk about query efficiency, namely how
many queries in the average do we have to perform in order to find exactly we
are looking for. That efficiency found in commercial databases (1)
was extremely low: less than 0.1%!
That
general inefficiency is one of the big problems Internet has to overcome in the
near future. We are not going to discuss here the reasons of this inefficiency
but only to say that it is mainly due to the Websites owner’s lack of
responsibility. Most people do not respect the netiquette – the Internet
etiquette rules- lying, exaggerating their sites worth, trying to deceive
navigators and robots-, in fact, trying to oversell
themselves through their Websites.
To make the
things worse, the search engines simplify too much the process adding their
proprietary noise to the sites owners’ noise, resulting then a square noisy media, that is a power two noisy environment.
One first
step is then to build databases with professional and “true� comments. For a
given major subject, for instance “women’s health�, the first milestone should
be to have a credible documents database concerning that specific subject. In
that case we have to ask ourselves: how many basic documents will have that
database to have to deserve be titled as a ‘Virtual Library�?.
The exact answer is almost impossible to say but we could talk about boundaries
instead. When we talk about library we mean a collection of books and in this
case we have to locate a sort of e-books, Websites resembling classical books.
Turning then to define boundaries we may talk about a library with a volume
ranging from 2,000 to 4,000 books (2) and in our case of
Virtual Library the location and clever summaries of an equal number of
Websites resembling e-books. That’s not too much indeed (3) talking
now in terms of Cyberspace!.
We have
then to ask ourselves the next two questions: Do we may find those kinds of
e-books in the Web?; Is it possible to select
efficiently that specific library out from the Web?. And the answer is yes in
both cases for most of the major subjects of our human activity.
Now our
problem is bounded to locate efficiently those crucial Websites. However we
have to face another problem once located them: how to search fast and
efficiently within a Virtual Library of let’ say 3,000 Websites óe-books, complemented by a 10,000 to 100,000 technical and
scientific documents Auxiliary Library (Reviews, Journals, Proceedings,
Communications, etc).
The problem
could be stated as follows: How we could build efficiently an efficient Virtual
Library?. Let’s face first the second problem: How to
build efficient Virtual Libraries?. Let’s suppose we
have to design a Cancer Virtual Library, (Altavista found 3,709,165 pages
as the search outcome for “cancer� at 6:00 PM of day 03-07-01). Of
course, in our Virtual Library we are not going to search among more than 3
million Websites but only in 3,000 but still that number is
big enough in terms of searching time.
Let’s
imagine ourselves within a real library with those 3,000 books filling the
space of three walls from ceiling to floor. If we are interested in finding all
the literature available for a specific query surely we are going to need some
indexing system to locate all the books dealing in some extent and deepness
with the query questioned and reviewing them afterwards. Even having an
adequate indexing system and a filing of “intelligent summaries� of all the
books we will spend a couple of hours selecting the set of books supposedly
covering the whole spectrum of the query.
Fine!. We are getting to the point of discovering a betterment
methodology to design an efficient e-library
- Select
the basic 1,000 to 3,000 Websitesóe-books;
- Design
an indexing system with an intelligent summary (i-Comment) of each e-book
depicting the main subjects dealt within.
Keywords versus
subjects
The
summaries must be true, objective and covering all matters dealt within their
corresponding e-books. To be true and objective we only need adequately trained
professionals. To cover all the matters the trained professionals must browse
the whole e-book and know what “matters� mean.
In that interpretation we introduce some subtle details derived
from our searching experience. People really look for “keywords�, that is,
meaningful words and sequence-of-words triggering our memory and our
awareness. Many of these keywords become knowledge items within a hierarchy
of concepts for a given major subject. The keywords are important for us
depending of the circumstances not derived from its hierarchical importance
within a given major subject.
|
When an
author makes the index of its book he thinks in terms of rationality and as a
member of the society respecting the established order. The index resembles a
conventional and sequential step-by-step recommended teaching and learning
procedure. On the contrary, who is searching makes queries looking for what
he/she needs as a function of the circumstances. The index resembles the Law.
So the Thesaurus that collects all the possible
keywords of a given discipline is not a hierarchical logical “tree�. Each
keyword is generally associated to many others within the Thesaurus as a
transient closed system and sometimes a bunch of them could be matched to specific
item/s of a tree logical structure. The Thesaurus is the maximum possible
order within the chaos of the circumstances.
The logical
trees, all the indexes we could imagine, are only “statistical� and
conventional rules at a given moment of the knowledge. The knowledge, along its
evolutionary process takes the form of a subjective Thesaurus because each
person has its own Thesaurus for each major subject of his/her interest.
The Law and the Circumstances
Notwithstanding
we could make both concepts work together in the sake of searching efficiency.
The logical structures are good as starting procedures, in the learning stages.
Besides that, as the trees comes out from statistics the use of a given Thesaurus
could give rise to new and more updated logical tree indexes via man-machine
interaction along an evolutionary process. The indexes are too rigid and
obsolete easily.
Now we can
enter into the core of our new methodology to build Intelligent Virtual
Libraries the ones we titled i-URL’s in the sense that each URL hosts a basic
e-book, a crucial document, a hub, one authority
e-Thesaurus: a
collection of all known keywords (at a given moment in a given place, for
instance Today in the Website www.xyz.com ), eminently a subjective cyber
concept.
i-URL’s:
i-Comments of basic Websitesóe-books,
with the significant keywords dealt within the Website plus some aggregate of
i-tags, intelligent tags defining its morphology, the properties of the Web
space body.
i-URL’s
Database: the database of all the
i-Comments of all the Websitesóe-books
that define the Virtual Library (at a given moment in a given place, for
instance Today in the Website www.xyz.com )
Virtual Library Index:
the indicative index of the i-URL’s Database content,
is the index that appears as the “by default� Menu to orient an ordered
browsing of the Virtual Library. As a matter of knowledge it is only valid for
the people that interacts with the Virtual Library as
a market-as-a-whole. This index is not adequate to orient the search but the
learning. It should be updated from time to time as suggested by the i-URL Virtual Library Integrator of the Expert System (4).
How
an I-URL looks like
style='width:384pt;height:4in'>
\
In the figure
above, the yellow dot represents a reference site for a given Major Subject of
the Human Knowledge; let’s say Personal Financing. The dark green dot within
the green Users’ region represents a set of users interested in that Major
Subject, let’ say the target market. Represented as a gray
crown is the Map of the Human Knowledge, actually inexistent. A group of
people interested in capture this potential market decides to build a reference
site about it, let’ say a Personal Financing Portal. So, first of all, they
need something equivalent to the sector of Personal Financing Virtual Library
of e-books or reference sites actually existing in Internet. To accomplish that
they proceed along the steps described in the beginning of this document.
The i-URL
Septuplet: For each reference site they
create an I-URL as an information septuplet as follows:
1. i-URL:
http://www.major_subject037.com, that is an e-book dealing with the subject 037 of Personal
Financing, for instance “Financial Resources�. A Human Expert reviews this site
with a major in Financing within possible, specially
trained to evaluate any type of site patterns.
See “Site quanta� and our chapter about
Taxonomy of Websites.
2. Subject
– Logical Tree: in this case, “Financial Resources�, one of the branches of
the Logical tree initially loaded into the system as a first approach of How
a Personal Financing Portal Should Be.
3. Strategic
Information: all kind of “coordinates� of the site and of the evaluation
done: dates, site origins, for example country, organizations where it belongs,
evaluator references (the human being that’s doing the evaluation), languages,
jargons, etc.
4. Site quanta:
all data about the structure of the site: type, importance, “size�, “deepness�,
“wide�, design features, architecture features, etc.
See our section about Taxonomy of
Websites.
5. Human
Comment: the core of the site evaluation, written by an expert once
reviewed the site. It may contain some other site references (links) shown as
yellow lines and must be expressed as much as possible using keywords (shown as
green dots) of the Thesaurus at hand.
6. Keywords:
a set of the most significant keywords that depict the site thinking as users,
left to the personal criteria of the evaluators. Some keywords could be even
not being actually present in the site but anyway the evaluator considers that
it deals with it. The system is engineered to count how many times the i-URL was
referenced by each specific keyword.
7. Statistic
Counters: Where we have defined three types of counters:
presence counters, “a priori� interest counters and confirmed interest
counters. Presence counters count how many times the i-URL
was queried by the system in order to satisfy potential users’ needs. A priori
interest counters count how many times this specific i-URL was fully requested
and confirmed counters count how many times the users request the site in full.
How do we get the Initial Virtual Library
Fine!. We have defined what an efficient Virtual Library about a
specific major subject means. It’s straightforwardly conceivable that this
system works but a problem still remains:
As we build
Expert Systems that learns from the users man-machine interactions our main
problem is then how do we get our first i-URL’s Database, how do we locate the
first 3,000 e-books. Once solved this problem the Expert System will improve
and tune-up the Virtual Library along an evolutionary path.
This is a
typical egg-chicken problem: what first?: an initial
Thesaurus or an initial Subjects Index?. As one brings and positively feedbacks
the other no matter how do we start. For instance we
may start with an initial index provided by some expert as our seed. From this
initial index we may select the first keywords to start our searching process
or by the contrary, we may start with an arbitrary collection of keywords as
our seed also provided by an expert. In any case we must behave as head eggs
hunters trying to catch our first e-book, let’ say the first full content
Website authority concerning our major subject.
This first
candidate to become an e-book will provide us either a subject index or a tool
to better our initial Thesaurus. By sure within this Website we will have more
reference links that will open our panorama driving ourselves to find better
Websites or complementary sites or both. This is a sort of scientific artisan
methodology well suited to deepen our knowledge about something with no precise
rules but general criteria. We will see that for all these tasks we design
specialized Intelligent Agents to act as general utilities to make the process
efficient.
One
criterion is trying to fill all the items covered by the best index we have at
hand at any moment. That is, we investigate each milestone e-book as much as we
can until the dominant items dealt with are fully covered and then we continue
looking for more e-books that cover the remaining items of the index until the
full coverage has been attained.
To
accomplish this task we need searching experts with a high cultural level
trained to switch fast from intuition to rational context and vice versa and
within rational tracks able to switch fast between deductive and inductive
processes as well.
First
Round of Integration: Once built a
Thesaurus covering all the items of the first index (this index has probably
evolved along the search with new items and amendments) we must begin the basic
e-books integration pivoting in the milestone e-books complemented with new
searches using the most “popular� keywords (the ones that have more milestones
indexed), The “exploration� of the milestones neighborhood is accomplished at
high speed via pure intuition along a process we titled the “first round�. To
select Websites in this first round we follow a “new rich� criterion: if the
Website look nice for our purpose we select it. To say
something about facts and figures we are talking about from 30 to 40
milestones, mainly authorities and hubs and from each milestones selecting from
100 to 200 Websites totaling from 6,000 to 8,000 Websites as the outcome of he
first round. This first round works over a first raw selection made via
infobots that query and gather Websites taken from search engines so the human
experts really work over a rather small universe.
Second
Round of Integration: Once built this
“Redundant� Virtual Library we must make a tune up of it keeping the 1,500 to
2,000 best suited to our purposes and that will be the e-books collection of
our initial Virtual Library. To select them we use a logical template,
screening the most important Website attributes, such as: type, its traffic,
design, Internet niche, universality, bandwidth, deepness, etc.
See in our section about Hints how we
check the database completeness and redundancy via Intelligent Agents.
With this
template we proceed to build our i-URL’s, that is, the
intelligent summaries of the e-books of our initial Virtual Library. We must
emphasize here that the e-books remain in their original URL’s locations. The
only data we record in our i-URL’s Database are the i-URL’s.
Versus non intelligent Virtual Libraries and versus
classical Search Engines
This is a
rather sophisticated and heavy “only once� task but the advantages are
comparatively enormous compared to the use of the classical search engines (5):
- We
built proprietary Virtual Libraries versus copy-and-paste non intelligent
Virtual Libraries
- With
a probability near 100% and absolutely under control the general users are
going to find what they want making true our assert YGWYW, You Get
What You Want;
- We
built a system that evolves positively as times passes by, with noise
tending to zero, auto generating a scenario YGWYN You Get What You
Need;
- Our
Virtual Libraries generates intelligence, mainly from users
interactivity. WYN, What You Need and WYW What You Want are
continuously matched against WWO, What we offer, providing to the
site owners marketing intelligence
- Universality:
Our i-concept is extensible to all type of
documents. With an Expert System of this nature we may homogenize Web
URL’s, proprietary documents, man-machine interactions (queries, chats,
forums, e-mail, mailing lists, newsgroups,
personal and commercial transactions) and news.
Notes
Note 1: Along this line we made a joint research
study with the Mexican university Instituto Tecnológico de Monterrey analyzing
e-Commerce Databases efficiency with the following astonishing results: a groups of students of the Systems Engineering career
queried an Industrial Database with 200,000 Latin American firms. They were
trained in how to search by keywords, for instance by product, and the positive
matches were lower than the 0.1%!.
Note 2: we are talking about basic books.
Of course this information basement must be complemented with thousands of
technical and scientifically publications as well
Note3: Only considering Web documents we are talking
of about one and a half billion documents and we have to consider the others
Internet resources such as newsgroups and millions of “pages on the fly�
generated in chats and forums.
Note 4: All our Expert Systems work under control of
a Virtual Integrator, that integrates the Expert System with all kind of
systems extensions such as, front-ends, back-ends, Intranet, Extranet, etc.
Note5: To remedy the search engines
inefficiency some sites decide to build proprietary content, that is a
collection of critical documents trying to answer a reasonable FAQ. This is
extremely useful and necessary and we recommend it but it’s not enough.
Effectively, the sum of real knowledge dispersed in the Cyberspace is so big
for any major subject that any particular effort is like a drop of water into
the ocean. Of course we may strategically design our “drop of water� in order
to demonstrate that we are alive as referents and not mere passive Internet
mediators.
5- Some Program Analysis Considerations
1- Thesaurus evolution Keywords Popularity and
something more
style='width:387pt;height:169.5pt'>
The figure above depicts a typical
user track. We may define in each track the following significant events:
· Enter: a new user enters a query, asking for a given keyword
within a given subject (optional)
· c: means a positive HKM database answer to a query; in the
figure k1, k3,…,kn have positive answers meanwhile k2 doesn’t.
· C: means hat the user decides to retrieve one of the basic
documents of the Web space and catalogued as belonging to the Universal HK
Virtual Library. This is a crucial instance of the tracking. Effectively, the
user abandons the site to dive into the outside document.
· Error: another crucial instance: the corresponding keyword (in
the figure k2) leads to an error: supposedly the
referenced site is not hosted anymore in that URL address.
· Leave: the user leaves the system, but could still make a…..
· Re-entry: the user re-enter into the system, very important from the
point of view of HKM usage, for some other keywords string within the same
subject or for a different one.
· Subject: the user is emphatically invited to report a subject,
apart from keywords; however he/she is not obliged to provide it.
· r: another crucial instance: the user statistically decides
either to return to the system or to continue browsing the Web space by his/her
own means.
· Main Subjects’
Tutorials: eventually, FIRST offer users a
set of tutorials where the main subjects of each Major Subject of the HKM are
thoroughly explained.
Warning: We are talking about
existing keywords, that is, the users query the HKM by existing keywords.
Perhaps the most crucial event occurs whenever an inexistent keyword is queried
provided it’s correctively written. Some things must be investigated by FIRST
in this case: a) test if the keyword is inexistent within an specific main
subject but it’s present in the HKM database for the queried Major Subject; b)
test if the keyword is inexistent in the HKM database for the queried Major
Subject but could be present in some others; c) test if it’s absolutely out of
the HKM.
See below the different
groups of keywords. First must analyze the existence/non-existence of not
recognized keywords for all those groups. The Chief Editor FIRST must carefully
review these cases once properly reported by.
We could
improve our insight deepening into each incident, namely:
Over c:
Once a couple [keyword, subject] is keyed and properly checked about all types
of consistencies programmed, FIRST answers with a hierarchical list of either
the selected i-URLs or their corresponding briefs. The later procedure invites
to mark the most appropriate with a click. The user could even navigate within
the same list, that is, within the same couple.
Over error:
Eventually the users could get a wrong URL address (however, these kind of errors must be avoided as much as possible). The
system must make the most of these opportunities trying to offer the user some
alternatives: similar URL’s (once checked the link works properly!) and/or advising
to consult related tutorials within the system. Independently, these events
must trigger one of the searching intelligent agents either to locate where the
URL could have migrated (the most probable condition) or in an extreme to
proceed looking for new documents. The potential documents to replace the lost
one must be sent to the FIRST Chief Editor who finally approve/disapprove the
new document once the corresponding i-URL is edited. Once finally approved, the
announcement of the new document must be emailed to the users that previously
authorized the system to be warned.
Note: An internal clock measures for each
user the time duration session: once gone out to review something the system
waits a reasonable time to receive the user as working along the same session.
User may change subjects along one session.
Possible
strings are:
[k1, c, C, k2, k3, c, k4, c, C,
leave] subject i
[k1, k2, c, c, c, k3, c, C, c, C,
k4, k5, leave] subject j
In the
first string for subject i, the user decided to make a click over an URL once
reviewed its i-URL, then returned, searching for k2 and k3 but just peeping
without being interested to read the list of i-URL’s provided, then tried with
k4 and making another click over another URL and finally leaving the system.
In the second string for subject
j, the user sweeps over k1 but review extensively k2
list and with k3 made two more searches, then another sweeping over k4 and k5
to finally leave the system.
As our purpose is to keep only
keywords strings, those strings could be summarized as follows:
[k1, k2, k3, k4] subject i
[k1, k2, k3, k4, k5] subject j
Where we go from a cold color (blue) to a very hot and
active one (red). For each session and
for each subject the keyword strings are saved for statistic purposes.
Statistics are made by string as they are and alphabetical.
All
keywords and i-URL’s traffic are from time to time statistically analyzed.
Let’s see how the Thesaurus evolves. For each keyword we have at each moment
two variables: its quantitative presence within the Logical Tree structure and
its popularity. We may define within the Thesaurus the following groups:
a) Regular keywords
b) Synonyms of specific keywords
c) Related keywords to specific keywords
d) Antonyms of specific keywords
a versus b and their
respective popularities tell us about how well designed are the synonymies
a versus c and their
respective popularities tell us about some semantic irregularities
a versus d and their
respective popularities tell us about searching patterns the must be deeply
investigated
For
instance if in politics we detect a high popularity of peace and conversely a
low popularity of war it means that people is changing its attitude concerning
the crucial problem of peace versus war. We may investigate also all the other
possible combinations b versus c, b versus d, and c versus d.
Analysis of some
other types of user interactions
We may save
all users login and from time to time depersonalize them defining common behaviors,
common searching patterns. We are going to find all imaginable kinds of
searching patterns, namely
o Users that dive wide
and shallow systematically
o Users that dive wide
and deep systematically
o Users that dive wide
and shallow at random
o Users that dive wide
and deep at random
o Users that dive focused
and shallow systematically
o Users that dive focused
and deep systematically
o Users that dive as
picking at random eventually either shallow or deep
All these and many other categories and
divided in frequent and eventual users as well.
A user
could feedback FIRST in the following ways:
o Making comments about
i-URL’s ó c stage
o Making comments about
specific URL’s once reviewed ó C
stage
o Making comments from
pre determined behavior-tracking places strategically distributed along the
system, for instance: before entering the query process, leaving the query,
before leaving the site, during the query process.
o Making open suggestions
from inside the site
o Making open suggestion
from outside the site
We may
design the user interface warning users when they are ready to abandon the
system, and welcome them when coming back from C type inspections.
Eventually
as we commented above the users could get a wrong addressing.
Path
keyword ó string correspondences
We said
that for each path of the initial logical tree of a given Major Subject of the
KNM we define a string of keywords; whether possible with priorities; let’s say
from left to right. After the three-stage procedure depicted in the FIRST white
papers, we have an initial set of correspondences between paths and strings,
being both related to specific subjects under each Major Subject.
style='width:387pt;height:136.5pt'>
After a
measurable evolution change for a given user’s market the initial World Virtual
Library of HK changes. In the figures above we depict such a change. Some documents – central figure- will be considered “useless� (light yellow
regions) and some were added to the system, extracted from the HK
as_it_should_be region (reddish regions). Finally the third figure shows how
the actual World Library of the HK and its related HKM will look.
Topologically, for the next evolutionary step we consider the situation like
initially, with a red circle within a larger yellow circle but leaving a
smaller yellow crown.
However, if
we do not change the logical tree and the Thesaurus accordingly, the procedure
will fail. To make the red region converge to cover as much as possible the
yellow region the procedure will enter in a vicious circle.
6- Noosphere Mechanics
Red: The HKN
model,
a Human Knowledge Network sample, a
cultural model constituted by a set of selected Websites
|
Yellow:
The HKN as_it_should_be,
depicting the whole Human Culture
without dominant cultural biases
|
Orange: The pre HKN model,
the set of documents, articles and
essays that establish the “formal� HK basement at a given time for a given
culture
|
Blue: The
opinions, thinking movements,
drafts, tests, communications that feed
the orange crown
|
From Light
Blue to Black, the massive Noosphere,
a continuous of “bodies� (Websites)
hosting and broadcasting information and knowledge
|
óRed: ~ 500,000 sites
óYellow: ~ 1,000,000 sites
ñOrange: ~ 5,000,000 documents
ññBlue: ~ 100,000,000 documents
ñññMassive Noosphere: ~
1250,000,000 sites
ó approximately
stable in volume. ñ high rate of
increase: the more arrows the highest rate.
style='width:179.25pt;height:138pt'>
Red Points:
worthy Websites dispersed and extremely diluted along the Web space. The worth
is a function of the culture and of course of time: for instance a site
depicting how a 4 months old baby is swimming could be considered unworthy
today but perhaps could be a fundamental document within 200 years for some
disciplines of the Human Knowledge.
style='width:178.5pt;height:138pt'>
In the figure above is shown a
worthy site “discovering� made by FIRST, the Expert System that manage the HKM.
FIRST is continuously searching for new sites deserving to be filed in the HKM.
It is not shown here how the HKM detach itself from references to “useless�,
obsolete and incomplete sites.
style='width:173.25pt;height:134.25pt'>
In the
figure above, a set of “red points� forms a net, augmenting its worth
substantially, growing each node by mutual inter nurturing and growing
collectively as well. Some primitive examples are the “Virtual Communities� and
“Web Rings�.
7- An approach to Website Taxonomy
How
to browse a site to measure its structure
Parameter template
For the
implementation of our first Expert System, to administer the match making
process of a B2B site we were searching the Web along a two-month journey,
visiting more than 6,000 sites dealing with e-Commerce. Some of them were
Verticals some Hubs trying to encompass the main industrial activities and
services of a highly industrialized nation like USA. For each of them we tried to take into consideration the
following set of factors:
· Type of site
· Its Traffic
· Its Design
· Its Universality
· Its Bandwidth
(schematic)
· Its Deepness
· Its Level of
functionality (in our case Verticality)
· Uses of the site
· Types of users
To identify all these variables we
designed first an Utopian Universe of Authorities endowed with everything
imaged, for instance the USA Library of Congress, http://www.loc.org, the NASA, the American
Airspace Agency, http://www.nasa.gov
or the WTO, the World Trade Organization, http://www.wto.org/, huge and complex sites that supposedly deal
with its Major Subject in an integral way. Browsing carefully within some “clusters�
of authorities at that level, just comparing “pound per pound� among them and
with other minor sites dealing with the same Major Subjects we tested the main
variables of our template.
Type of site: We found
many types of sites, sometimes not easy to define, because many of them were a
combination of several types: Specialized Websites with proprietary content,
Portals, Directories, Facilitators, Portal of Portals, Vortex, Vortex of
Vortexes, Platforms, Yellow Pages, etc.
The Design has substantially
changed lately. Up to now we were witnesses of the Web evolution with designs
made to attract traffic and to maintain a reasonable loyalty to the site. Every
detail was exhaustively considered: speed, readability, sequencing, layout,
colors, wording, flow, login, customer support, errors handling, etc.
The Web
user behavior is not well known yet, but we could depict some classical
characteristic in order to get both a nice first impression and a durable
membership.
- The design and its content must conform an
indivisible unit
- The design must be serious and professional in that
order of preference
- The design must avoid sterile and deceptive “tours�
to attract traffic
- The design must prioritize the users
time and needs.
Deepness: By deepness,
we understand the average size of the site tree level, that is, how many
clicks, in the average, could we go inward finding valuable information.
Bandwidth: The width of
the announced subject’s spectrum covered, measured from high to low, from rich
to poor or in percentage.
These two,
deepness and bandwidth concepts, probed to be extremely important to define the
potentiality and quality of a site, that is for example, how do we
differentiate a site with a wide bandwidth and a deepness defined as 3, from a
poor site with a wide bandwidth, but almost empty, to orient and to facilitate
the task of users, taking into consideration that the average user will not be
able to appreciate that kind of subtle differences from the beginning.
Verticality (Functionality) was another concept we were not yet used to appreciate
easily. By that we mean how integrated is the Major Subject covered along the
site as a whole. We found verticality as inversely related to bandwidth. This
concept was particularly useful to compare e-Commerce "Verticals".
Use of the site: the institutional, professional, academic, religious,
communitarian, or commercial use of the
site: to enhance its institutional image, to work for the human welfare, to
fight against something, to virtually behave as a community center, etc.
Type of
users: we defined three type
of users looking for three types of resources, namely: beginner, medium, and
expert looking for information, knowledge, and entertainment.
Universality refers to
take into consideration all the possible users’ needs, expectations and
cultural differences and all possible users’ jargons, from beginners to experts
and from small size enterprises to big corporations and geographic origins.
Finally the
traffic is a very important factor but
very difficult to evaluate accurately.
A fast and
straightforward way to compare Websites could be structured taking into
consideration only three of those variables, namely:
· B- Bandwidth
· D- Deepness;
· Q- Quality as an overall
judgment of value of the remaining variables
· Type of users
And
accordingly we could then define a qualitative-quantitative three-dimensional
metric BDQT. The deepness is
decisive to judge the seriousness of a site. Most of the sample investigated of
nearly 6,000 e-Commerce sites had an average deepness factor lower than 2!. For instance a well known B2B site alleged to have more
than 50 vortex implemented but only 10 had a deepness of 4 and the rest one
click and the nothingness concerning real proprietary content!.
8-
FIRST within the vast world of AI – IR
Some contextual ideas and hints to improve
its implementation
FIRST niche
FIRST is a methodology to create a basic Knowledge index of
the Web with some auto learning capabilities that allows
Web users to find relevant information in only one click of their mouse. That’s
all. FIRST is an Information Retrieval Methodology that has few in common with
KR methodologies, languages, and algorithm to represent the Human Knowledge in
a true form, has some things in common with Knowledge taxonomy. I believe that
FIRST could be considered an AI primitive application that emphasizes the role
of humans as experts to start running in its evolutionary process. Even being
the purpose of FIRST so humble but the task to start running as an almost
autonomous Expert System capable to learn from users – Web interactions, is so
immense, it was thought to be aided by two communities of IA’s (generally
“knowbots�) to optimize the work of the initial staff of human experts and once
running replacing progressively the human intervention till full autonomy. The
first community was conceived to speed up the completion of the initial or
“mediocre� solution and the second to make that mediocre solution evolve along
the time. We may then imagine operating within the always exponentially
expanding Web a network of these cells of HK, perhaps clones of one initial
mediocre solution, but evolving differently depending of the users’ communities
(human beings) and depending of the general policies that control the behavior
of the HK administrators, being them either humans or IA’s.
FIRST is completely
defined in a set of 6 “white papers� by itself. In this section our aim is to
present FIRST to the scientific community. The Human Knowledge could be
depicted as an infinite semantic (and why not emotive?) network with a
complexity not yet known. Many relevant work has been
done in that direction with contributions that range from metaphysics and
philosophy to mathematic and logic as we will review below but many
contributions are common sense findings. FIRST fall within this last category
of thinking as many IR tools of the past like KWIC and as many search engines
approaches of the present like Yahoo, Altavista and Google.
We are going then to navigate by the Web making steps in
some “authorities: concerning our aim. Our first step will be the Research
Index of the NECI, Scientific Literature Digital Library
(the site was moving to this new place, check I!). As a sound
proof of the expansion and mobility of the Web when I was rehearsing the sites
and documents of this section, the NECI “old� site moved and not only that,
most of the references I was consulting disappeared, being replaced by new
focus of interest like for instance learning, vision and intelligence.
One of the “rules of thumb� advised by our experience is the
cardinal to determine the taxonomic size of our HKM, fixed in 250 Major
Subjects. As explained in our white papers 250 are like a common upper limit in
important Hubs. But let’s take a look at the index of AI taxonomy as depicted
in this site. Concerning our own research about Human Knowledge it’s another
example of knowledge itemization: the whole literature dealing with this Major
Subject (Digital Library) that behaves as a Hub for almost every other Human MS
encompasses 17 subjects and nearly 100 sub-subjects being Knowledge Representation
under Artificial Intelligence subject one of them.
KR:
Knowledge Representation talks about the
concept of hubs and authorities, two polar kinds of “nodes� within our HK
subspace: nodes that acts as hubs pointing towards the basic and most popular authorities.
We are going to consider the following essays:
· SENECA, Semantic Networks For Conceptual
Analysis, from Ernesto GarcÃa Camarero (egc [at] swvirtual [dot] es ), J. GarcÃa Sanz y M.F.
Verdejo of the Centro de Cálculo, Universidad Complutense de Madrid and
Universidad Politécnica de Madrid, Spain 1980.
· A Survey On Web Information Retrieval Technologies (the document is provided in dpf format), from Lan Huang,
Computer Science Department, State University of New York at Stony Brook,
email: lanhuang [at] cs [dot] sunysb [dot] edu,
May 1999.
· Software
Agents: An Overview,
from Hyacinth S. Nwana from Intelligent Systems
Research, Advanced Applications & Technology Department, BT Laboratories,
Martlesham Heath, Ipswich,
Suffolk, IP5 7RE, U.K.,
e-mail: hyacinth [at] info [dot] bt [dot] co [dot] uk,
year 1996
Specifically Camarero applies its methodology to a piece of archeology, an object extremely bound to the past: one amphora, meanwhile Baral applies
Prolog (a promissory LP language created by Kowalski, Robert A., see "The Early Years of Logic Programming", CACM, January 1988, pages 38-43), to depict two classical problems resembling pretty well each one a piece of human logic: The Flying Birds and the Yale Shooting Problem. Both essays feed our hope about HK classification made by robots without human intervention or at least with negligible human intervention in a near future. However, the problem of the exponential expansion of the volume of the Web still remains.
DAI and KK: Meanwhile the HK in the Web space is a gigantic and living entity continuously expanding within the “noosphere�, what is really important for our practical purposes is the “Kernel� of it (FIRST points to the kernel of basic knowledge needs of Web average users). Concerning the intelligent outcomes of actual human beings only successful pieces of data and intelligence remain in the kernel. Of course to accomplish that evolutionary task of filtering and fusion we need something like a short-range memory registering every potentially valuable human intelligence outcome. What we do need are procedures to optimize the process of selection of data and intelligence to be added to the kernel or in some instances to replace parts of it because the HK follows a model of “non-monotonic logic� within an “open word context�.
Key behavior of some IA’s: We’ve seen in FIRST how
easy is to build it manually, using human experts, the first approximation to such a kernel. The problem is then how to improve
henceforth this kernel by mean of intelligent agents. Our first intuitive
approach was to trust on the mismatch process. When one user queries the kernel
there are two possible outcomes:
found, not found.
When found the users are attracted by the massive power of
the kernel like a gravitational force and we may suppose that they found what
they were looking for. When not found, we are in the presence of a real
fight_for_living scenario, in terms of intelligence behavior. When one user
can’t find something he/she tries to do his/her best to “win� either beating
the kernel or finding a back door open to enter. What
is really important is the track of those fights. Let’s imagine millions of
user trying to access myriads of such kernels from all over the world in
different languages and different jargons and belonging to different types of
marketing (in a broad sense) behavior.
DAI, network: We need local agents to record those
tracks, agents to typify them, “knowbots� (specific intelligent agents to deal
with tiny intelligent pieces) to
suggest ways of action and we need “mediatorbots�, negotiators agents to solve
conflicts, for instance due to errors. Once managed locally we are going to
have kernel clones by thousands and we are going to need “coopbots�,
cooperative agents to joint efforts made in different Websites of a knowledge
network and to behave socially. Very probably on users behalf we are going to
need “reactbots�, reactive agents as well, that is agents open to known and
unknown stimulus trying to react (mostly in friendly manners) to user actions,
for instance detecting wandering and disorientation. And finally, we are going
to need “learnbots�, learning agents and even “smartbots�, smart agents to
substitute human intervention in a process initially controlled by humans.
WEB scenario (based on July 1998 data):
at that time we accounted for laudatory
Web Knowledge Classification projects like Taper from IBM and Grid/OPD.
Concerning growth a projected rate of growing of 6% monthly, doubling each
year. Now we have nearly 1300 billon documents!.
However besides growing the Web scenario present significant differences
concerning IR approaches, namely:
· IR was designed for
static text databases no for an always growing and changing environment;
· Extremely high
heterogeneity, more than 100 languages and thousand of specific jargons;
· Hypertext nature with
High linkage: +8/page;
· Complex and unexpected
queries with a wide spectrum of users’ errors;
· Wide spectrum of users
of unknown behavior;
· Specific impatient and
even neurotic behavior: users only see the first screen!.
· IR performance is
measured along three variables: recall, global precision and specifically top
10 precision results.
Top search engines: Google: uses an innovative algorithm,
PageRank Algorithm, created by L. Page and S. Brin for page ranking; Altavista:
it’s one of the most complete databases and with its Advanced search facility
equals Google feature; Infoseek: is offering an interesting search among
results feature. Another Directories analyzed were Yahoo, Infomine, Britannica,
Galaxy and Librarian's Index.
HKM complementary
Information: In our white papers we talk
about the second priority for common Web users, the noosphere shell of
Technical and Scientific Information. This shell could be implemented via the
Northern Light search engine services that provides queries over thousands of
Journals, Reviews and Proceedings!.
Lexicons: Google, for instance had at that time nearly 14
million words!. FIRST will work with
an initial Thesaurus of 500,000 keywords. Most of actual lexicons words
are
References: names, titles, toponyms, brands. FIRST should
prevent this, pointing users to the classical search engines.
Operational Hints:
As any search engine hast three parts, a
Crawler, and Index System and a Database we must learn as much as possible from
these heavy duty components in order to implement FIRST. For instance, to
optimize the FIRST IR task we must take care of working with portions of DNS in
order to look for one server at a time, caching query results to browse sites
economically. Another problem will arise from HKM updates, being highly
recommended to do it incrementally instead totally as it is normally done in
today's search engines. FIRST architecture considers that feature. Even being
“one click� engine, FIRST answers to queries could be weighed via a
relevance-popularity algorithm, something like Page Rank Algorithm namely:
Given an (a) page and a set T of links pointing to that
page from other pages we may define the PR, Page Rank of (a) as
PR(a)= 1-d +
d[PR(T1)/C(T1) + PR(T2)/C(T2) ....+ PR(Tn)/C(n)]
Where:
d is a damping factor
and the C(Ti)’s are the links coming out from corresponding page(i).
Google
sets up d as 0.85, with 1>d>0.
We are talking here about "popularity" in terms
of Website "owners" not in terms of Website "users", being
then PR(a) the probability that a random user selects that page and d being the
probability that the random user get bored before requesting some other page.
Warning: we may design a d factor for our first HKM and
calculate all PR's. FIRST then will compute d and PR as users' factors instead,
by far more realistic!.
HITS (published in French), which stands for Hypertext Induced Topic
Search, created by John Kleinberger, from IBM to identify sources
of authority: We
find very useful to implement the concept of hubs (good sources of links) and
authorities (good sources of content). I think most of our URL's will correspond
to authorities. A good hub is the one that points to many authorities and
conversely a good authority is the one that is pointed to by many hubs.
As in the work of Kleinberger we are going to work on a
small but extremely select subset S(K) related to HK
basic documents with the following properties:
· S(k) is relatively small
· S(K) is rich in relevant pages
· S(k) contains most (or many) of the significant authorities
Kleinberger states that "by keeping it small, one is
able to afford the computational cost of applying non trivial algorithms";
"by insuring it is rich in relevant pages it is made easier to find good
authorities, as these are likely to be heavily reinforced within S(k)."
Then Kleinberger suggest the following solution to find
such collection of pages. For a parameter t, typically set about 200, HITS
algorithm collects the highest ranked pages for the query K (for instance for a
Major Subject of the HKM) from either Google or Altavista. These t pages are
referred to as root net R(K). HITS then increases the
number of strong authorities in our sub graph by expanding R(K)
along the links that enter and leave it. HITS works moreover on transverse
links that is, links that go out to external domain names.
Warning: will be interesting to compute weights for both,
hubs and authorities.
OGS: OGS, Open Global Ranking Search Engine and Directory, is a distributed
concept trying to use all search facilities opinions. They propose to add some
extra tags to HTML standard (a similar approach we used in our i-URL's). Warning: They still trust on Website owners
“honest Indian� behavior?. For example they suggest:
<a href=............
cat="/news/computers" rank="80%">
Stating that the author considers that document a serious
and valuable one (80 out of 100!). OGS is still an open proposal to be
collectively managed by us as users. The proposal is naïve but intelligent and
within the Internet utopia of fairness, openness, democracy and freedom. In
summary it propose small changes in the HTML standards
to allow users easily states their opinions about the information on the sites
they create links to.
Technically what is
proposed is a small change of the HTML standard that lets people easily state
their opinions about the information on the sites they create links to. These
opinions include the category to which the site belongs and the rank of the
site in that category according to the person making the link. Then, the
opinions are weighted according to the author's reputation in a field, which in
turn is also determined by such weighted opinions of all the people that have
expressed them. We are going to study carefully these concepts when
implementing FIRST
TAPER: TAPER, which stands for Taxonomy And Path Enhanced Retrieval
system, was developed by Soumen Chakrabarti, in collaboration with
Byron Dom and Piotr Indyk , from IBM Santa Teresa Research Lab, year
1997. You may find also the related document (in pdf) Using
Taxonomy, Discriminants and Signatures for navigating in Text Databases,
written by Soumen Chakrabarti, Byron Dom, Rakesh Agrawal and Prabhakar Raghavan
of IBM Almaden Research Center,
1997.
Basically
Taper is a hierarchical topic analyzer that achieves high speed and accuracy by
means of two techniques: at each node in the topic directory, TAPER identifies
a few words that, statistically, are the best indicator of the subject of a
document. It then 'tunes in' to only those words in new documents and ignores
'noise' words. The second technique guesses the topic of a page based not only
on its content but also on the contents of pages in its hyperlink neighborhood.
We use a similar approach in the first step when building the mediocre solution
of FIRST. You may find a nice document
dealing broadly about we are discussing here in the doctoral thesis focused on How do we find
information on the Web by yangk [at] ils [dot] unc [dot] edu,
March29, 2001, School of Information and Library Sciences, University of North
Caroline at Chapel Hill
Web Sizing: the idea suggested in Krishna Bharat and Andrei Broder
paper (you may access to their papers at DBLP
Bibliography) is straightforward and much of common sense:
to sample statistically search engines universes. Related to this issue of
measuring the Cyberspace, such us Web mining and Web metrics, you may find some
other works of Bharat and Broder in Cybermetrics.
They designed a sampling procedure to pick pages uniformly
at random from the index system of major search engines. The conclusions were
that 80% of the Total Universe (200 million documents occupying 300GB's) is
indexed at any moment. The biggest one at that moment, Altavista, registered
50% of that universe and the intersection of all engines probed to be extremely
poor: 4%!. Concerning some deep investigations about
how the Web evolves we recommend to see Web Archeology, by the Research Group of
Compaq where one of the archeologists is Andrei Broder. One of the collateral
outstanding findings of these investigations was that almost one third of the
documents hosted in the Web are copies and that from nearly 1million words of a
full English dictionary only 20,000 are normally used by navigators. Concerning
that I’m convinced that most Web users make their queries with an extremely
poor vocabulary of no more than 3,000 words. By the way this is very easy to
investigate and I did my own personal estimation in Spanish with a sample of
250 students of the career of Systems Engineering in an Argentine University
and found they used nearly that.
KR - One outstanding
IR “authority�
We were reviewing the book Knowledge Representation of %20sowa [at] bestweb [dot] net,
August 1999, commented at BestWeb that has everything we are talking about.
We comment some parts here to appreciate globally where we are. Sowa
distinguish three necessary components of any IR study, listed as the subtitle
of its work: Logical (logic), Philosophical (ontology) and Computational
foundations. We are going to use it as a mathematical background to implement
FIRST programming, the tutorial at that effect prepared by Sowa.
The level of our white papers is a first global step, to be
easily understood for everybody with a minimum IT/Internet background. The
second level must be the general directives expressed using the glossary
implicit in this section. The third level must be expressed algorithmically
either using mathematical IR notation as used in the mentioned tutorial or some
Logical Programming Language and the fourth level is just entering into the
software realm.
New and Old ideas in action now
Clustering
Clustering is a relative “old�
technology that once we get one answer to a query it could be organized in
meaningful clusters so if it works and if does not take significant process
time it always add never subtract concerning a better understanding (no
ranking). You may see in action in the new search engine VivÃsimo
originated in the Carnegie Mellon University and launched in February 2001.
It works fine in scientific literature, web pages, patent abstracts,
newswires, meeting transcripts, and television transcripts. It is a Meta search engine because it works
over several search engines at a time applying the clustering process to all
the answers. It’s especially apt when users don’t know how to make accurate
queries: they advise to use regular search engines in those cases. As they work
directly on the pipeline of answers the procedure is titled “just in time
clustering�.
Clustering in some extent along
our idea of working more on keywords than over categories and any type of
classification. The heuristic algorithm works freely over the answers without any
preconception. The process is rather fast: clustering 200 answers of nearly
three lines each tales 100ms in a Pentium III 1 GHz. Let’s try what happens
with “clustering�. It gives 194 results (documents) and a set ob branches of
the tree which root is clustering. It shows only part of the branches that help
us to select clusters. For instance if we select data, it delivers 22 documents
along the path clustering>data where we found some documents related to clustering
definitions and data mining. On of the documents in this cluster explains what
a cluster is, namely:
In
general, a cluster is defined as a set of similar objects (p.9 Hartigan). This
"similarity" in a given set may vary according to data, because
clustering is used in various fields such as numerical taxonomy, morphometrics,
systematics, etc. Thus, a clustering algorithm that fits the numerical measure
of optimization in a data may not optimize another set of data (for example,
depending on the units selected). There are many algorithms to solve a
clustering problem. The algorithms used in our applet concentrate on
"joining", "splitting", and "switching" search
methods (also called bottom up, top down, and interchange, respectively). They
are shown by their representative methods: minimum-cost spanning tree
algorithm, maximum-cut, and k-means algorithm.
Good
enough to start knowing something concrete about clustering. To search for
methods we use the branch clustering>methods where we find only 5 documents
but all valuables. We are going to carefully study how to implement clustering
in FIRST, mostly because we believe that the general user do not have an
accurate idea what is looking for. Usually users have some keywords more or
less related to their needs and sometimes they have an idea about the name of
the subject.
We may see a recent paper about Clustering and Identifying Temporal
Trends in Document Databases, (or download from our site), from Alexandrin
Popescul, Gary William Flake, Steve Lawrence, Lyle H. Ungar, C. Lee Giles, IEEE
Advances in Digital Libraries, ADL 2000, Washington, DC, May 22–24, pp.
173–182, 2000. To check results they used the Citeseer database available in http://csindex.com,
which consists of 250,000 articles on Computer Science and they used 150,000.
Their algorithm works on the ideas of co-citation and previously determined
influential papers.
Teoma
Teoma is a project of Computer Labs at Rutgers University launched on May 2001 trying to excel Google. Teoma calculates relevance using
link analysis to identify "communities" on the web, and then
determines which sites are the authorities within those communities to find the
best pages. Whereas Google uses the collective wisdom of the entire web to
determine relevance, Teoma tries to identify "local" authorities to
help identify the best pages for a particular topic.
Collection of Glossaries is another laudable
effort made by Aussie Slang (that’s not a woman but it stands for Australian
Slang!) to facilitate the users’ navigation. It’s only a directory of
glossaries and dictionaries that could be useful to the initial tasks to build
the HKM, when gathering trees, paths and keywords of Major Subjects of the HK
(they say have catalogued more than 3,200 glossaries, really an upper limit to
the volume of our Thesaurus.
We comment here some parts of Towards Knowledge Representation: The
State of Data and Processing Modeling Standards, from tony [at] ontek [dot] com
of Ontek Corporation, 1996: It’s another source to fully depict the state of
the art related to KR. In the Web domain we are dealing with conventional
knowledge, at last forms of the classical written knowledge complemented with
some images. There are some other forms of knowledge related to social
structures like social groups, organizations and enterprises not easy to
represent. For instance we may write a resume of the Library of Congress site, trying to
describe it with words but it will be extremely difficult to provide a map of
its built in knowledge as an institution. Perhaps in a near future those maps
could form part of the organizations as a by default image. This paper deals
with languages and models to depict organizations to be universally understood.
It’s something in a similar line to XML and XQL but referred to organizations,
standards under the control of ISO, the International Standards
Organization or International Organization for Standardization. We have to
allocate room in the I-URL’s of FIRST to take into
consideration that near future possibility.
In the same form that we are talking about HKM networks we
have to prevent heterogeneity, that is some other
forms of HKM and of course different forms of KR. So to preserve the future of FIRST
we must try to think of it compatible with all imaginable forms of maps and
representations. In that sense we must have into account even the compatibility
with actual NPL’s, Natural Programming Languages. FIRST implementation by
itself will need an organization and it will be the opportunity to use
well-probed methodologies like for instance CASE to model it, redundantly to
serve as a model.
As there are many
development lines along knowledge we have to locate FIRST in the right track
from the beginning. Knowledge should be represented, giving place then to KR
tools and methodologies; the knowledge must be organized to access it giving
place to libraries and finally it must be administered for human welfare giving
place to knowledge management. Knowledge is the result of the socialization of
humans and once generated is shared by a community. Shannon
found the equivalence between energy and information and now we need to go a
step further to find the equivalence of knowledge. Intuitively the Internet
pioneers always talked about three main resources: information, knowledge and
entertainment. In the Web we may find specific sites for each resource and
there are many like Portals having the three. Before Internet didn’t exist what
we call “noosphere�, the knowledge was stored in libraries, books, and in our
minds, and only pieces of information and knowledge were matter of
communications, an agreed traffic of messages among people. Internet and
specifically the Web bring an universally open and
free noosphere where people are absolutely free to obtain what they need in
terms of information, knowledge and entertainment from their e-sources. By
first time the knowledge is up there, up to you, when you need it, anywhere.
The only problem to be solved even being free, open and universal is how to
find it and how to understand it or more technically how to retrieve it and how
to decode it. To decode it properly all documents in the noosphere must be
standardized, and for this reason, programs, tutorials and data of FIRST must
be implemented via standards, for instance DSSSL for data
To know precisely the state of the art of the realm where
FIRST is going to operate we must take a look to the efforts made concerning
Digital Libraries. In our white papers we mentioned that the amount of “basic documents�
that represents the HK for a given culture at a given time tends to be rather
constant or moving upwards or downwards at very small pace and sometimes
fluctuating around rather constant averages. For instance we cited the famous
Alexandria Library that when destroyed stored about 300,000 basic documents. We
are now talking about 500,000 reference e-books, not too much more.
The Alexandria
Digital Library Project, in Santa Barbara, California, USA, is focused in Earth Data. The
central idea is that once finally implemented, be the origin of a
world-distributed network of mirrors-clones like the network we imagine for our
project. For us this library will be an authority of the type earth sciences
=> geo systems => image libraries. Normally Websites authorities lead to
related authorities and in this case the rule draws:
Carnegie
Mellon University
Stanford University
University
of California at Berkeley
University
of Illinois at Urbana-Champaign
University of Michigan
Informing us that the National Science Foundation
(NSF), the DOD, Department of Defense's Advanced Research Projects Agency
(ARPA), and the National Aeronautics and Space Administration
(NASA), three Internet “big ones� pioneers sponsor all those libraries and are
part of the leading project Digital Libraries Initiative
where they add the cooperation of the extremely worthy National Library of Medicine, the Library of Congress and the NEH, National Endowment for the Humanity and
recently, the last but not the least the FBI, Federal Bureau of Investigations.
With such
demonstration of power we wander about the freedom utopia future of Internet.
No doubt, all of them are super-authorities but be careful, the Big Ones Labs
of medical drugs are too well intentioned and extremely powerful concentrating
too much I_am_the_truth power. We were talking about 500,000 basic documents
out of an expanding Web universe that now has 1300 billion documents. Will the
HK be concentrated in quite a few reference houses or dispersed in 500,000?. Or perhaps will show us an harmonious combination of
dependence versus independence in matters of knowledge?.