Introduction
One of the most common comments about the SMBmeta Initiative is something to the effect of: "We've had meta-tags before
and spammers abused them so they are no good." (For example, see the
News.com article about SMBmeta.) The purpose of this essay is to explore the spam issue and see how it relates to the specifics of SMBmeta.
Search engines use various algorithms to determine relevance of a particular page to a search. Those algorithms are supposed
to fairly rank results based on assumptions about what "normal" good web sites would do, with an eye towards looking out for
cheaters. Let's discuss what "spam" is in the context of search results.
There are two types of search engine spam, as I understand it. The first type is when a web site tries to show up in
searches for which it is not appropriate for the purpose of increasing random traffic to a web site. This is much like the
shotgun email spam that hopes somebody out of the millions they send to will find it of interest. A very problematic example
is pornography web sites that attempt to get traffic from unsuspecting browsers looking for something else (e.g., the porn
whitehouse.com vs. the official whitehouse.gov). I'll call this "Shotgun Spam".
The second type of spam is when a web site tries to get preferred position in the search results. This ranges from making
sure that a web site is not treated worse than its peers due to some oversight on the author's part (such as not using words
related to its subject on a home page or in its title), to trying to get extra benefit by packing in key words in text only
seen by search engines and creating networks of linking sites to appear popular. I'll call overzealous use of such techniques
"Search Engine Optimization (SEO) Spam". Making sure that a web site does what the search engine expects and avoids the appearance
of "bad" behavior that would inappropriately penalize it is a good thing, so not all SEO is bad, just SEO Spam that knowingly
and purposely games the system.
Why is spam bad? It is bad because if it succeeds in inappropriately biasing results, searchers learn that they can't
trust those results and will depend upon the Internet less. The success of Google and others in finding relevant pages in
response to a query has been cited as being a major reason for the success of the Internet. According to the Pew Internet
Project's
Counting on the Internet report, such web sites "...help deliver on expectations and...have become, for many Internet users, trusted online sources
and tools." The search engines value this trust highly and will gravitate towards ranking techniques that help meet, and raise,
those expectations. They abandon techniques that result in frequent obvious "surprises" that lower the trust in their ability
to rank results.
The old Keywords meta tag
For a few years, search engines were programmed to consider the contents of an HTML tag in the form of <meta name="keywords"
content="word1, word2, etc.">. Web site authors could put this tag in the heading portion of a web page's source, along
with other tags such as <title>title text</title>. The idea was that these keywords were hints to the search engine
indicating topics for which the page was relevant. Since early search engines merely indexed words, this was a manual means
to pull out the "key" words and associate terms of art ("Chinese food") with the specific words on the page ("moo shi").
Simplistic algorithms, such as frequency of use of a word, were used, so having a Keywords meta tag with the same word
repeated could "bump" your position. The keywords were for all intents and purposes only visible to the search engines. When
you saw a search result, you were not told that a page was shown because of a particular keyword. All this led to the ability
to do both Shotgun Spam and SEO Spam without the searcher knowing it was happening. Search engines found the spam so prevalent,
and their other techniques for searching so able to make up for the loss of benefit from those keywords, that they pretty
much dropped looking at Keywords meta tags. (See
Revisiting Meta Tags in SearchEngineWatch.) A few apparently still use them when there are no other ways of finding a requested search term
using their normal techniques.
Other header meta tags, such as the <meta name="description" contents="text"> tags, and the <title> tag,
are used by many of the search engines, including Google. The difference here seems to be that those are shown to the searcher
in the response, and in the case of the <title> tag, normally visible on a page to a browser. The act of gaming this
system is more readily visible to the searcher, so they can choose to ignore results.
Multiple pages and third party validation
Another problem with the Keywords meta tags was that they were on a per-page basis, and only written by the author. A
web site author could inexpensively create hundreds of pages, each with its own set of meta tags, to appeal to multiple audiences.
External validation was hard, because if you found a problem with one page, it didn't necessarily help you with invalidating
what you found on another.
With its "PageRank" technology, Google used the links from other web sites to the page being indexed as a way of determining
its value and topic. This added third-party validation of some sort, though through complex webs of interrelated web sites
authors could game that system initially, too.
The goals of traditional search
With the vastness of the material on the Web, the goal of most searching seems to have been to find answers to questions
by finding the most relevant web pages. The goal has not been to find all possible answers. The type of questions searchers
ask are "Tell me about this disease I just found out I have", or "Tell me about this actor I just saw in a movie". Yahoo!
even helps restrict the businesses in their directory by making it hard to get in and charging you to speed it up.
This is different than many directories we have come to depend upon in normal life, such as telephone directories ("white"
and "yellow" pages). Until the advent of cell phones and other telecom competition, one could expect to find all people with
telephones, and especially all businesses, listed. Only those that explicitly did not want to be listed were skipped, and
they paid for avoiding that default of being listed.
When a searcher is trying to find a particular business, or find all of the possible businesses of a particular type
from which to choose, the "no false positives" attitude of search engines can be a problem. If you limit your search narrowly
enough, a "no false negatives" approach, with enough information shown to let the human searcher make knowledgeable decisions
using those results, may be more of what people want. I may want to see all carpenters, not just "good" ones. I may be searching
for "that one in Needham I used last year but don't remember."
What do we learn?
Some of the things we learn from the Keywords meta tag experience are the following:
- Showing the searcher data used in doing the search can help the searcher determine if a result is relevant and ignore
inappropriate ones
- Find ways to have third party validation
- Make it easy to flag abusers
- Do not be biased by simple repetition
- Some detailed queries can only be answered using data provided by the web site author
- Authors will go to the trouble of providing meta data if they think it will help searchers
- If it's easy to be abused, it will
How might SMBmeta data be used?
SMBmeta files have a few different types of data. There are attributes, like "language" and "type", that are chosen from
very specific lists of possibilities. Other data, such as the body text of the <description> element, is free-form.
Finally, there are some Internet-specific values, such as domain names and URLs.
In the initial design of SMBmeta, it was assumed that most of the data in the file would be displayed to the searcher
as part of the results (though not all at once necessarily, depending upon the type of search). This is to help you choose
the web site to look at for further information. Since the fixed attribute values can be confining, the text provided could
be shown to provide more complete description, but the values themselves were assumed to be the thing mainly used for the
selection. If the fixed values were not sufficient for a particular type of search, then the free-form text could be used,
too, just as free-form text on a web page is used. It would be clear to the searcher reading the results which data was authored
by the business and not vouched for by the search engine.
What this means is that the SMBmeta information would be used the same way as other web site author-provided data, like
the title, the description, and the text surrounding the searched-for words. It would be shown to the searcher in a manner
that made the origin clear. Unlike the Keywords meta data, it would not be hidden. In fact, one use of SMBmeta information
could be to indicate the business name associated with a web site for normal search listings of all types, especially non-home
pages. Instead of just seeing www.xyz.com/foo/bar.html you could also see a name and perhaps a geographical indication. Web
domains for which this would be inappropriate would not have SMBmeta files.
Since inappropriate repetition could be used to bias results, such as claiming to be in all 700+ business type categories,
it was assumed that search results would either indicate such behavior (by listing all the categories claimed or at least
the number if there were more than a few) or explicitly block any domains that were clearly using too many. The specification
explicitly states that descriptions may be truncated in search result listings, so they should be short. Any attempt to provide
too much could just be ignored.
What is different about SMBmeta data from normal web page data?
One of the special requirements of SMBmeta files is that there is only one per second-level domain (e.g., "yourname.com").
That domain name can then be used as a simple key to many databases for checking uniqueness, authenticity, etc. This is unlike
most data used by search engines that are concerned with individual web pages. SMBmeta is concerned with single small businesses
and their entire web site. With only 20 million or so businesses in a country like the USA, validation databases can easily
be searched, and probably fit in RAM. The data is designed to be relatively static and small, so such databases can be created
slowly and incrementally, and shared with others.
One problem with web pages that trips up some search engines is that the data they use and the data customers see can
be different. This technique of serving up different data to the search engine's spiders than to people with browsers is called
"cloaking". Unfortunately for spammers, some of the main use of SMBmeta data can be done with just the data found when spidering.
The uncloaked data may never be seen. If different data is shown at different times, the database nature of storage, together
with the single domain name "key", would overwrite old data leaving only a single set for searching.
Affirmation
A common question about data on the Internet, and SMBmeta data in particular, is how do you know it's true? Is there
some sort of authority affirming something about the file?
The design of SMBmeta, with the domain name "key", is built around making it easy to run such "authorities". Any number
of databases can list any or all SMBmeta files they know about and indicate whatever they want about them. It does not require
the cooperation of the business web site owner. What it does require, though, is the cooperation of the search engine operator
to check an authority.
What types of things would one want to be affirmed by a third party? Some obvious examples are:
- The smbmeta.xml file exists and is in valid format
- The web site associated with the file corresponds to the information claimed in the file (e.g., no Shotgun Spam)
- The business named in the file and on the web site corresponds to the claims in both (e.g., no Search Engine Optimization
Spam)
To aid this important function, a new element is being added to the original SMBmeta proposal: the <affirmation href="URI"
signature="value"> element. This lets the SMBmeta author give hints to a search engine about where it might find the domain
listed. If the search engine considers that "authority" authoritative, then it could check on the domain, or use the optional
signature for authenticity. The element is optional and there can be more than one. Affirmation Authorities will be covered
in another essay.
The assumption is that search engines could indicate affirmation of various sorts with simple flags, and perhaps use
it as a filtering mechanism. It would be good, though, for the choice of authorities and how their data is used to be a simple
switch on the search page, much as filters are used in other parts of the searching. For example, Google currently lets you
filter for language, both type -- English, Spanish -- and "explicit". When you are doing a deep search for a needle in a haystack,
you sometimes want to turn on "show all" once you've narrowed your search down fine enough.
There is a lot of experience on the Internet with various types of rating systems. Because SMBmeta is so open and deals
with so common an interest, it lends itself to distributed, volunteer ratings, as well as more explicit, corporate-generated
ones. Unlike self-policing systems like email spam, with the blunt instrument of blocking entire huge domains for email because
of the perceived transgressions of a few that may be outsiders, SMBmeta authorities can be much more fine-grained.
There is no SMBmeta equivalent of Hotmail or Yahoo! Mail domains. Domains that use SMBmeta are much more likely to be
single businesses with single points of failure. Proof of compliance is much easier, and finding out that your domain is blocked
(and by whom) is much easier. A value-added of the search engines is in integrating all the available data. Unlike swamped
email-providing ISPs, search engine operators are much more aligned with doing this.
How does SMBmeta deal with the spam problems?
Shotgun Spam: The way to Shotgun Spam SMBmeta is to appear to be a legitimate business in a category, but link off to
a web site that is totally unrelated. The inexpensive way to do this is to set up a domain with an SMBmeta file that claims
the properties of a popular search target, such as restaurant in a major city with great location, hours, etc. It is only
upon clicking through to the web site (where detailed contact information should be found) that the searcher would realize
that they were misled. Of course, in such popular situations, there should be much more corroborating data for the search
engines to help disqualify the listing. In this case, someone figuring out that they are being misled should happen frequently
enough (and be easy enough to check) for simple reader-rating reporting to work. Our experience at Trellix with patrolling
for inappropriate pictures and other material on personal web sites can help here. Unlike that situation, it is much more
expensive and difficult to suddenly appear and get lots of traffic with SMBmeta, and the reaction time to inappropriate material
can be just as quick (in minutes or hours) with not too much manual effort. For example, new entries can be given out sparingly
in broad searches until there has been time for feedback to get into the system. A more specific search (which the web site
owner would do to check that they posted correctly) could still succeed.
The more expensive approach for Shotgun Spam is to obtain many domains, each with their own SMBmeta file, to appear to
be in a large number of specific search targets, so as to be one of the few results in to the large number of narrow searches
that may not have as much corroborating data. Unlike Keywords meta tags, where one can easily list hundreds of keywords on
each of hundreds of pages, all redirecting to the same place, there is a real cost, probably well in excess of $10 per year,
for each potential search. To be local to thousands of postal codes, with hundreds of types of businesses, can be very expensive
since you would need a separate domain for each few postal codes and business types and the descriptions have to sound real,
too. Putting all those postal codes in a single SMBmeta file would be a flag in itself that could cause disqualification.
If you are found out and the domain is flagged by a single affirmation service, all search engines may choose to block that
domain and you have to start anew. Bulk domain registrations from frequent spammers may be easier to track down. The economics
are nowhere near as good as other techniques.
Search Engine Optimization Spam: This is harder. It is similar to any false information on a web page, and similar to
any false advertising. Presentation of the data can make exaggerated claims more obvious, when most others are giving factual
data. False claims, such as long hours of operation or non-existent locations will come back to haunt a business, as now webloggers
and others will have a reason to link to the web site of the heretofore unknown business, but with a negative review. The
search engines are likely to pick such things up. Blatant misrepresentation, as in any advertising, can lead to legal problems.
It is likely that most businesses will periodically check what you would find about them and their competitors on the Web,
and react strongly to competitors that try to misrepresent things. Unlike Keywords meta tags, though, misuse of SMBmeta should
be very obvious because the data will be presented for all to see, and the domain owner is a matter of record.
Conclusion
The fact that the old, discredited Keywords meta tags and SMBmeta both have meta data is not a reason for SMBmeta to
have the same fate when it comes to spam. The situations are quite different. In fact, aspects of the SMBmeta data can make
searching for brick and mortar businesses much more trustworthy than it is today.