on Ontology, Tagging, Seach, & Commerce
The best and most relevant (to me) web dissertation I’ve ever read was Clay Shirky’s Ontology is Overated. I do not hope to come even close to the clarity and relevance of that manifesto, but I hope to add to the discussion with a narrower (commerce) rather than wider (information retrieval) focus of the topic from the perspective of a specific application. Secondly, I want to take a historical & BROADER view of the topic within the context of e-commerce. Lastly, while there is no ultimate winner (game is not over) nor the right way to architect an “e-commerce information retrieval system,” to this date, there has been a winning methodology as proven by revenue, profits, and even marketcap. How the pendulum will swing in the future, I dont know, but recent technology improvements certainly has allowed various architecture to compensate for the short comming of each.
(BTW, I’m using ontology/taxonomy/attributes as a generalization of any structured content, not technically correct but useful in this case)

At the two ends of the spectrum of e-commerce implementation of a product retreival system are
1. Search Engine + Unstructured Content - Product information is created by product owner (seller, dist, manu etc) in an adhoc manner with minimal regards to standardization or formating. A seach engine is used to find relevant product for buyers based on various algorithms (keywords, pageRank etc)
2. Query + Structured Content - Best way to think about this is a attributed query field and a attributed catalog. Essentially a SQL database with a structured query interface.
There are several examples of along the spectrum.
Google - In the purest sense, Google (not Froogle) is the perfect implementation of such a system with completely unstructured data and search engine
eBay - In the SKU-less world of ebay (circa 2000), seller enter product information in a semi-structured manner. Furthermore, there is no effort to consolidate listings with the same SKU into one giant listing. As far as eBay or any machine is concerned each product listed for sale is completely unique. A search engine is implement to search listing titles and sometime descriptions.
Delicious - There is really no “tag” implementation of a e-commerce search so I’m just gonna let delicious be my straw-man. Some might argue that Delicous and eBay should switch, I however would argue that the act of tagging a product with a set of specific tags is more restricting and thus more structured than eBay’s “Listing Title.” Furthermore, as you’ll see later, eBay and Delicious is creating a Recall/Precision tradeoff consistent with the rest of the spectrum. (BTW, eBay does have a categorization scheme but not in the context of its search engine. The scheme essentially offers an alternative method of navigation. But if you want to, you can switch eBay & Delicious on the spectrum because of this issue)
Amazon - Amazon has a catalog that is SKU centric in that product title and description are standardized for each unique product. Sellers of that product has to list his or her product under that SKU.
Chemdex – A long dead but very relevant example. In many ways represent all B2B e-commerce companies back in 2000. Like a lot of B2B implementation of an e-commerce info retrieval system, aka catalog, Chemdex has a very sophistical, highly attributed, highly structured product content. It has the very definition of an Ontology or Taxonomy (depending on your own interpretation of the word).
As we all know, companies that have taken the critical product strategy decision on the LEFT side of the spectrum on unstructured content has become the dominant players in the e-commerce world. For various reasons I will go into, Google and eBay has garnered a disproportionate amount of the e-commerce spend. Especially in the case of eBay vs. Amazon, the power of the unstructured content has won over rigid standards. While many would argue that eBay has much better business model (no inventory) than Amazon and thus is the leading players, I would argue that because Amazon has adopted this virtual model since 2000 and has yet to narrow the gap, it shows that it is actually the superior product architecture that is the driving force of eBay’s growth. Fundamentally, it is also this unstructured product content architecture that has allowed eBay to maximize its virtual model and thus is the true source of its competitive advantage.
There are several key differences in the spectrum:
1. Sophistication/Effort – On one end, the critical product and differentiation factor is better search algorithm, on the other end, the critical factor is content creation. Essentially, player on the left side of the spectrum decided to spend money on “understand the mess” while on the right side on “cleaning up the mess.”
2. SKU – Due precisely to the decision above, adding & creating content for players on the left is so easy, unlimited # of products can be sold and managed leading to breadth. On the right, because the bottleneck of the commerce system is on the creation of the catalog, companies are forced to focus on the product they can sell and drive inventory turnover for those SKUs (ie depth).
3. Investment – Equally important, Google and eBay lives and die by the “power” of its search engine and thus spend significant money on creating the best of breed algorithm or user experience. Chemdex and Amazon, on the other hand must invest in content creation throughput usually in the form of man power. (Chemdex spends disproportionate amount of its money on this task and eventually went out of business because it too so long, the quality was so bad, and so expensive.)
There are also some key trade offs too
1. Speed – Search Engines are by definition faster than Query Engines. Your SQL results on 100X less magnitude of data is still slower than a Google search. This has serious effects on the user experience especially in B2B.
2. Precision – A key search engine concept. Connoting the “relevancy” of the individual returned results. Structured content typically returns more precious results because more attributes and parameters can be specified by the “buyer”
3. Recall - Another key search engine concept. Connoting the “coverage” of the returned results. IE regardless of # of results returned, as long as all relevant results are included, it has good recall. Unstructured content typically has high recall due to the “fuzziness” & flexibility of its algorithm. Structured content, on the other hand, has serious issues as mentioned by Clay Shirky.
4. Flexibility - This is THE key reason that unstructured content won over structured. The flexibility to sell ANYTHING (kidney on eBay!) allowed eBay to evolve without management interference while Amazon required the creation of new content and new categories.
5. Data Mining - On the other hand, the ability to understand data through structured content is the key competitive differentiator that Amazon has over eBay or Google. It can mine data extensively to create sophisticated cross selling, up selling, recommendation, and personalization features that Google will be hard pressed to implement due to the fact that its data is “dumb.” While this had always been Amazon’s strategy is was still not enough to overcome the rigidity of its product catalog architecture.
These differences and trade offs were made by the various players in the industry. To this date, buyers have shown that a good search engine and unstructured product information source is the superior architecture for creating an e-commerce focused information retrieval system. Thus intelligence has won over brute force. Oh ya, I too, think ontology/taxonomy/attributes is over rated not just philosophically but for business.
I believe the past history of e-commerce search will have serious implications for the so called SEMANTIC web but I’ll save that for the next post when I can think more clearly. (Hint, I’m in Clay’s camp)
Just some of the things I read on tagging recently, there is a lot btw so this is not comprehensive:
Unfolding Ontology from Alex
The Yin and Yan of Tagging
More Clay
More Clay on Tim Bray’s Q
A blog on tagging: You’re It
Fred’s Tags





I have read your article with great interest, but I have a few comments, which I hope will be welcome.
I see where you are going in comparing the two different paradigms of open search of unstructured data (i.e. - Google), vs. the ability to find data based on it’s categorization or tagging (i.e. - Amazon).
I think it’s a good comparison - open search vs. tagged categorization. But I see a little overlap, and here is where -
Google tags all of the web pages it is aware of. It does this through the use of webcrawler bots that search out all of the words (data elements) within a web page, and with the exception of known articles (the, a, an, of, etc.) it then stores the knowledge about the web page in question in a huge data repository (it also grades, or weights, this data based on the number of pages linking to yours, and the data elements of those pages, and so on and so on). When someone enters a search in google, it then does a query against that database (which is performed on a high performance computing cluster of thousands of computers - but it is essentially the same sql query that your small LAMP b2b server might perform), and based on the tagging that google imposed, it returns weighted results. It is a query based on data tagging.
Ebay is another interesting area - the data on ebay about items is not really the free search that it appears on the surface. The description words that you enter when you post an item for sale (as well as the categories - tags - that you label the item with) are stored within a database, and queried when a user searches for something.
In both cases (as well as amazon, yahoo, delicious, technorati, etc etc etc) there is some digital marking going on.
Where I see the interesting thing happening with sites such as google is this - rather than forcing you to fit your data into a strict ontology (or taxonomy), it employs a very agile, dynamic taxonomy which in essence is comprised of all the possible words that a human can post to a web page (with the exception of articles - the, a, an, etc). This taxonomy is redefined and reformed dynamically every time every instance that the webcrawlers update the google databases. More realistically, this is not a discrete event, but closer to a continuous event, as the google crawlers are always crawling.
In his original article on the Semantic Web, Tim Berners Lee wrote that data would have to be digitally marked. Whether this is through tagging based on a formal domain upper ontology, or more dynamically (through data categorizers, crawling bots, folkonomies, etc) - it doesn’t matter. The semantic web (which is I believe where we are going) is looking for some sort of tagging, to give all the data out there some meta data. Where Clay Shirky is right is this - it doesn’t need to be the rigid view that some of the current ontology researchers believe it is (it can’t be).
Chuck
Comment by Chuck Turnitsa — July 1, 2005 @ 4:00 am
Chuck, thanks for comment. I was looking at the the various implementations from the angle of the end user experience rather than the back end architecture. In the end, most of these systems do have a database at the end which stores much of the data. Indexing (and various variants) are used to speed up the query etc. To actually search through each webpage or product catalog word by word for each search request will take hours if not days. Completely infeasible for the use cases needed.
Comment by Administrator — July 5, 2005 @ 10:08 pm