TrellixTech.com
SMBmeta and Spam
Home | Weblog | Writings | SMBmeta | Others

SMBmeta is different than the old Keywords meta tags

Introduction
 
One of the most common comments about the SMBmeta Initiative is something to the effect of: "We've had meta-tags before and spammers abused them so they are no good." (For example, see the News.com article about SMBmeta.) The purpose of this essay is to explore the spam issue and see how it relates to the specifics of SMBmeta.
 
Search engines use various algorithms to determine relevance of a particular page to a search. Those algorithms are supposed to fairly rank results based on assumptions about what "normal" good web sites would do, with an eye towards looking out for cheaters. Let's discuss what "spam" is in the context of search results.
 
There are two types of search engine spam, as I understand it. The first type is when a web site tries to show up in searches for which it is not appropriate for the purpose of increasing random traffic to a web site. This is much like the shotgun email spam that hopes somebody out of the millions they send to will find it of interest. A very problematic example is pornography web sites that attempt to get traffic from unsuspecting browsers looking for something else (e.g., the porn whitehouse.com vs. the official whitehouse.gov). I'll call this "Shotgun Spam".
 
The second type of spam is when a web site tries to get preferred position in the search results. This ranges from making sure that a web site is not treated worse than its peers due to some oversight on the author's part (such as not using words related to its subject on a home page or in its title), to trying to get extra benefit by packing in key words in text only seen by search engines and creating networks of linking sites to appear popular.  I'll call overzealous use of such techniques "Search Engine Optimization (SEO) Spam". Making sure that a web site does what the search engine expects and avoids the appearance of "bad" behavior that would inappropriately penalize it is a good thing, so not all SEO is bad, just SEO Spam that knowingly and purposely games the system.
 
Why is spam bad? It is bad because if it succeeds in inappropriately biasing results, searchers learn that they can't trust those results and will depend upon the Internet less. The success of Google and others in finding relevant pages in response to a query has been cited as being a major reason for the success of the Internet. According to the Pew Internet Project's Counting on the Internet report, such web sites "...help deliver on expectations and...have become, for many Internet users, trusted online sources and tools." The search engines value this trust highly and will gravitate towards ranking techniques that help meet, and raise, those expectations. They abandon techniques that result in frequent obvious "surprises" that lower the trust in their ability to rank results.
 
The old Keywords meta tag
 
For a few years, search engines were programmed to consider the contents of an HTML tag in the form of <meta name="keywords" content="word1, word2, etc.">. Web site authors could put this tag in the heading portion of a web page's source, along with other tags such as <title>title text</title>. The idea was that these keywords were hints to the search engine indicating topics for which the page was relevant. Since early search engines merely indexed words, this was a manual means to pull out the "key" words and associate terms of art ("Chinese food") with the specific words on the page ("moo shi").
 
Simplistic algorithms, such as frequency of use of a word, were used, so having a Keywords meta tag with the same word repeated could "bump" your position. The keywords were for all intents and purposes only visible to the search engines. When you saw a search result, you were not told that a page was shown because of a particular keyword. All this led to the ability to do both Shotgun Spam and SEO Spam without the searcher knowing it was happening. Search engines found the spam so prevalent, and their other techniques for searching so able to make up for the loss of benefit from those keywords, that they pretty much dropped looking at Keywords meta tags. (See Revisiting Meta Tags in SearchEngineWatch.) A few apparently still use them when there are no other ways of finding a requested search term using their normal techniques.
 
Other header meta tags, such as the <meta name="description" contents="text"> tags, and the <title> tag, are used by many of the search engines, including Google. The difference here seems to be that those are shown to the searcher in the response, and in the case of the <title> tag, normally visible on a page to a browser. The act of gaming this system is more readily visible to the searcher, so they can choose to ignore results.
 
Multiple pages and third party validation
 
Another problem with the Keywords meta tags was that they were on a per-page basis, and only written by the author. A web site author could inexpensively create hundreds of pages, each with its own set of meta tags, to appeal to multiple audiences. External validation was hard, because if you found a problem with one page, it didn't necessarily help you with invalidating what you found on another.
 
With its "PageRank" technology, Google used the links from other web sites to the page being indexed as a way of determining its value and topic. This added third-party validation of some sort, though through complex webs of interrelated web sites authors could game that system initially, too.
 
The goals of traditional search
 
With the vastness of the material on the Web, the goal of most searching seems to have been to find answers to questions by finding the most relevant web pages. The goal has not been to find all possible answers. The type of questions searchers ask are "Tell me about this disease I just found out I have", or "Tell me about this actor I just saw in a movie". Yahoo! even helps restrict the businesses in their directory by making it hard to get in and charging you to speed it up.
 
This is different than many directories we have come to depend upon in normal life, such as telephone directories ("white" and "yellow" pages). Until the advent of cell phones and other telecom competition, one could expect to find all people with telephones, and especially all businesses, listed. Only those that explicitly did not want to be listed were skipped, and they paid for avoiding that default of being listed.
 
When a searcher is trying to find a particular business, or find all of the possible businesses of a particular type from which to choose, the "no false positives" attitude of search engines can be a problem. If you limit your search narrowly enough, a "no false negatives" approach, with enough information shown to let the human searcher make knowledgeable decisions using those results, may be more of what people want. I may want to see all carpenters, not just "good" ones. I may be searching for "that one in Needham I used last year but don't remember."
 
What do we learn?
 
Some of the things we learn from the Keywords meta tag experience are the following:
  • Showing the searcher data used in doing the search can help the searcher determine if a result is relevant and ignore inappropriate ones
  • Find ways to have third party validation
  • Make it easy to flag abusers
  • Do not be biased by simple repetition
  • Some detailed queries can only be answered using data provided by the web site author
  • Authors will go to the trouble of providing meta data if they think it will help searchers
  • If it's easy to be abused, it will
How might SMBmeta data be used?
 
SMBmeta files have a few different types of data. There are attributes, like "language" and "type", that are chosen from very specific lists of possibilities. Other data, such as the body text of the <description> element, is free-form. Finally, there are some Internet-specific values, such as domain names and URLs.
 
In the initial design of SMBmeta, it was assumed that most of the data in the file would be displayed to the searcher as part of the results (though not all at once necessarily, depending upon the type of search). This is to help you choose the web site to look at for further information. Since the fixed attribute values can be confining, the text provided could be shown to provide more complete description, but the values themselves were assumed to be the thing mainly used for the selection. If the fixed values were not sufficient for a particular type of search, then the free-form text could be used, too, just as free-form text on a web page is used. It would be clear to the searcher reading the results which data was authored by the business and not vouched for by the search engine.
 
What this means is that the SMBmeta information would be used the same way as other web site author-provided data, like the title, the description, and the text surrounding the searched-for words. It would be shown to the searcher in a manner that made the origin clear. Unlike the Keywords meta data, it would not be hidden. In fact, one use of SMBmeta information could be to indicate the business name associated with a web site for normal search listings of all types, especially non-home pages. Instead of just seeing www.xyz.com/foo/bar.html you could also see a name and perhaps a geographical indication. Web domains for which this would be inappropriate would not have SMBmeta files.
 
Since inappropriate repetition could be used to bias results, such as claiming to be in all 700+ business type categories, it was assumed that search results would either indicate such behavior (by listing all the categories claimed or at least the number if there were more than a few) or explicitly block any domains that were clearly using too many. The specification explicitly states that descriptions may be truncated in search result listings, so they should be short. Any attempt to provide too much could just be ignored.
 
What is different about SMBmeta data from normal web page data?
 
One of the special requirements of SMBmeta files is that there is only one per second-level domain (e.g., "yourname.com"). That domain name can then be used as a simple key to many databases for checking uniqueness, authenticity, etc. This is unlike most data used by search engines that are concerned with individual web pages. SMBmeta is concerned with single small businesses and their entire web site. With only 20 million or so businesses in a country like the USA, validation databases can easily be searched, and probably fit in RAM. The data is designed to be relatively static and small, so such databases can be created slowly and incrementally, and shared with others.
 
One problem with web pages that trips up some search engines is that the data they use and the data customers see can be different. This technique of serving up different data to the search engine's spiders than to people with browsers is called "cloaking". Unfortunately for spammers, some of the main use of SMBmeta data can be done with just the data found when spidering. The uncloaked data may never be seen. If different data is shown at different times, the database nature of storage, together with the single domain name "key", would overwrite old data leaving only a single set for searching.
 
Affirmation
 
A common question about data on the Internet, and SMBmeta data in particular, is how do you know it's true? Is there some sort of authority affirming something about the file?
 
The design of SMBmeta, with the domain name "key", is built around making it easy to run such "authorities". Any number of databases can list any or all SMBmeta files they know about and indicate whatever they want about them. It does not require the cooperation of the business web site owner. What it does require, though, is the cooperation of the search engine operator to check an authority.
 
What types of things would one want to be affirmed by a third party? Some obvious examples are:
  • The smbmeta.xml file exists and is in valid format
  • The web site associated with the file corresponds to the information claimed in the file (e.g., no Shotgun Spam)
  • The business named in the file and on the web site corresponds to the claims in both (e.g., no Search Engine Optimization Spam)
To aid this important function, a new element is being added to the original SMBmeta proposal: the <affirmation href="URI" signature="value"> element. This lets the SMBmeta author give hints to a search engine about where it might find the domain listed. If the search engine considers that "authority" authoritative, then it could check on the domain, or use the optional signature for authenticity. The element is optional and there can be more than one. Affirmation Authorities will be covered in another essay.
 
The assumption is that search engines could indicate affirmation of various sorts with simple flags, and perhaps use it as a filtering mechanism. It would be good, though, for the choice of authorities and how their data is used to be a simple switch on the search page, much as filters are used in other parts of the searching. For example, Google currently lets you filter for language, both type -- English, Spanish -- and "explicit". When you are doing a deep search for a needle in a haystack, you sometimes want to turn on "show all" once you've narrowed your search down fine enough.
 
There is a lot of experience on the Internet with various types of rating systems. Because SMBmeta is so open and deals with so common an interest, it lends itself to distributed, volunteer ratings, as well as more explicit, corporate-generated ones. Unlike self-policing systems like email spam, with the blunt instrument of blocking entire huge domains for email because of the perceived transgressions of a few that may be outsiders, SMBmeta authorities can be much more fine-grained. There is no SMBmeta equivalent of Hotmail or Yahoo! Mail domains. Domains that use SMBmeta are much more likely to be single businesses with single points of failure. Proof of compliance is much easier, and finding out that your domain is blocked (and by whom) is much easier. A value-added of the search engines is in integrating all the available data. Unlike swamped email-providing ISPs, search engine operators are much more aligned with doing this.
 
How does SMBmeta deal with the spam problems?
 
Shotgun Spam: The way to Shotgun Spam SMBmeta is to appear to be a legitimate business in a category, but link off to a web site that is totally unrelated. The inexpensive way to do this is to set up a domain with an SMBmeta file that claims the properties of a popular search target, such as restaurant in a major city with great location, hours, etc. It is only upon clicking through to the web site (where detailed contact information should be found) that the searcher would realize that they were misled. Of course, in such popular situations, there should be much more corroborating data for the search engines to help disqualify the listing. In this case, someone figuring out that they are being misled should happen frequently enough (and be easy enough to check) for simple reader-rating reporting to work. Our experience at Trellix with patrolling for inappropriate pictures and other material on personal web sites can help here. Unlike that situation, it is much more expensive and difficult to suddenly appear and get lots of traffic with SMBmeta, and the reaction time to inappropriate material can be just as quick (in minutes or hours) with not too much manual effort. For example, new entries can be given out sparingly in broad searches until there has been time for feedback to get into the system. A more specific search (which the web site owner would do to check that they posted correctly) could still succeed.
 
The more expensive approach for Shotgun Spam is to obtain many domains, each with their own SMBmeta file, to appear to be in a large number of specific search targets, so as to be one of the few results in to the large number of narrow searches that may not have as much corroborating data. Unlike Keywords meta tags, where one can easily list hundreds of keywords on each of hundreds of pages, all redirecting to the same place, there is a real cost, probably well in excess of $10 per year, for each potential search. To be local to thousands of postal codes, with hundreds of types of businesses, can be very expensive since you would need a separate domain for each few postal codes and business types and the descriptions have to sound real, too. Putting all those postal codes in a single SMBmeta file would be a flag in itself that could cause disqualification. If you are found out and the domain is flagged by a single affirmation service, all search engines may choose to block that domain and you have to start anew. Bulk domain registrations from frequent spammers may be easier to track down. The economics are nowhere near as good as other techniques.
 
Search Engine Optimization Spam: This is harder. It is similar to any false information on a web page, and similar to any false advertising. Presentation of the data can make exaggerated claims more obvious, when most others are giving factual data. False claims, such as long hours of operation or non-existent locations will come back to haunt a business, as now webloggers and others will have a reason to link to the web site of the heretofore unknown business, but with a negative review. The search engines are likely to pick such things up. Blatant misrepresentation, as in any advertising, can lead to legal problems. It is likely that most businesses will periodically check what you would find about them and their competitors on the Web, and react strongly to competitors that try to misrepresent things. Unlike Keywords meta tags, though, misuse of SMBmeta should be very obvious because the data will be presented for all to see, and the domain owner is a matter of record.
 
Conclusion
 
The fact that the old, discredited Keywords meta tags and SMBmeta both have meta data is not a reason for SMBmeta to have the same fate when it comes to spam. The situations are quite different. In fact, aspects of the SMBmeta data can make searching for brick and mortar businesses much more trustworthy than it is today.

Last modified: 24 January 2003 4:04by DSB

Disclaimer:
This Web site is for exploring technical issues and contains material and opinions from a variety of sources whose accuracy cannot be determined.  Interland has made no effort to verify the accuracy of any statement on this Web site and the statements published on this Web site, including those of Interland's employees, are not necessarily endorsed or approved by Interland.  Investors should not rely on any statements in this Web site when making investment decisions.
 
(c) Copyright 2002-2003 Interland, Inc. All Rights Reserved.