Content is a dirty business. . . . more specifically content that has commercial/transactional value. (ie not entertainment and informational content.) I had a taste of the business a few years back at my B2B startup working to get industrial catalogs onto the site. After digging around the industry and meeting with different companies, I quickly realized it was an extremely incestious circle of people that cycle from company to company in an endless merry go round. The hardest part is that every player in this particular industry accuses of each other of various form of copyright infringement and it is next to impossible to figure out the real story. The reason is that the law is extremely fuzzy on the actual ownership the content. For example, if 10 manufacturers hands-off their product specification to a content normalization firm (who is commissioned to do the job by some other company) who really owns the content after all the work was done? Is it the manufacturers who created the content in the first place? the company that hired the content cleansing firm? Or the content normalization firm because their value add is so significant to warrant more than just derivative work rights? Or is it public domain? (how can anyone own the fact that a certain grade of steel piping only come in 6 standard form factors?) Can the content normalization firm resale that content even though it was under some sort of consulting contract? How can anyone prove that the firm “repurposed” the content rather than “started from scratch” if it was caught reselling the content to another company? Does it really matter if the process is manual or done via software or a combination of two? Even if the law is clear, it is next to impossible to prove any type of illegal practices. All it really take is to burn a CD-ROM and walk it out the door. And in fact, I quickly discovered (unfortunately not early enough) that most of these players are really recycling the same content.
I also quickly discovered that even more important than the content itself, the so called schemas - aka semantics, aka meta data, aka attributes - that are the most important “IP’s” in the industry. Once you know how products should be characterized (or how experts in the field define their products) , the job of matching and normalizing content to that schema is significantly easier if not trivial. Furthermore these schemas are what drives the discoverability of your search engine and create true comparability between products. Having a system that is both structured but responsive to change (for example if Apple releases a new 1000G iPod Nano today, I better have those attributes defined ASAP or the discoverability of ipod will suffer) is a competitive advantage. So the harder, more important question is . . . Is the ownership rights of content seperate from that of semantics? Or is it bundled? Is schema or semantic even “ownable”?
So why the long set up? by now you probably figured it out. . . instead of walking out the door with a CD-ROM what if I created a RSS feed and happened to send it to GoogleBase? What happens there? Is someone violating some sort of copyright law? Maybe not the content itself but how about the schema? What if I owned the content but someone else build the semantics around the data? Can I export it and give it to someone else? Can that someone (GoogleBase) use it to their own benefit without notifying me? What if instead of CD-ROM full of data, the data actually reside on web pages? And THAT is the issue we are facing today . . .
Today, vertical search engines are normalizing semantics around the content that that do not own (similar to content normalization firms of yester-years). Certain times these vertical search engines are feeding the content directly into GoogleBase; other times, Google is simply indexing the content through the search engine. In both cases Google is taking not just the content but the schema and repurposing it for their search engine. Do the vertical search engines care? Do the content owners? Who owns the content? who owns the schema? Does anyone have the right to stop Google? I dont have the answers . . . maybe someone does. . . .
In the not too distant future (6 month?) , once they gain critical mass and the bell curve reaches steady state, Google will have a pretty good idea what are the attributes of most of the transactional products and services (any “physical” or “metaphysical” objects really) on the web without lifting a finger by the virtue of their folksonomy based name:value pair attribute engine. At which time, they will be able to extract semantics out of webpages they crawl without the help of vertical search engines or expansive manual design. When it happens, Google will be able to launch a thousand vertical search engines with a switch . . . scary thoughts for vertical search engines which invested numerous man hours in designing their attribute rules and believes it to be their barrier to entry . . .
This is GoogleBase, content is just a means to an end . . . and the end is so called schemas, semantics, metadatas, and attributes. . . this is what I come to realize . . . that content is simply too perishable to be valuable but this other stuff with funny names . . . this is worth more than gold . . . this is the ticket to owning the semantic web . . .