The year is 1997. You’re wearing whatever people wore back then—some kind of jean jacket, I’m guessing—and talking to your friend about your new favorite movie, the recently released Mike Myers vehicle Austin Powers. You’re quoting the movie, and your friend thinks this is hilarious. Then things take a dark turn. “I thought Randy Quaid was excellent,” your friend says. “Randy Quaid?” you think, trying hard not to punch the wall. “Randy Quaid wasn’t in Austin Powers.” You try explaining this to your friend—“I believe,” you say tersely, “that you’re thinking of Clint Howard”—but your friend is adamant. To settle this dispute, and salvage what remains of your friendship, you boot up your 90-pound computer tower. Forty minutes later, you have made it onto the internet. The question now is: Where do you go? How, before Google, did people settle asinine disputes, and/or find other sorts of information? For this week’s Giz Asks, we reached out to a number of experts to find out.
Assistant Professor, Information, University of Texas at Austin, whose research is concerned with the emergence, standardization, and preservation of new information objects in mobile and social media platforms
Google Search dominates over 90% of a market that includes search engines like Yahoo, Bing, and privacy-driven DuckDuckGo. But before Google’s personalized, ad-driven search algorithm took over almost everything we can find on the web, there were website directories and indexed search engines that assembled web resources by topic.
The earliest web search engines were directories of websites curated by people. These web ontologists (Yahoo called them “surfers”) would read all the webpages about particular topics and then rank them. Eventually this human-driven model of categorization was replaced by crawling websites with bots (sometimes called spiders), then ranking websites by their reliability and relevance to different kinds of search queries. In the early 1990s there were about twenty different search engines to choose from, including WebCrawler, Lycos, AltaVista, and Yandex. Similar to library catalogs, these search engine indexes were compiled and organized by topic, content, structure and subject. Early search engines were designed so users could navigate to bundles of hyperlinked resources across different high level categories like “News,” “Travel,” “Sports,” and “Business.” Columns of broad categories crammed together full of blue hyperlinks for users to choose from made early search engine homepages look like the crowded index in the back of a textbook.
It’s important to remember that 1990s web searching had different goals and incentives for people “surfing the web.” In early online cultures, finding a fact or product wasn’t always the goal of searching. Instead, search engines helped people discover and explore digital resources and experience the world wide web. Web search in the 1990s had less ad-targeting and gave users more control to explore, even if the results were rudimentary and didn’t always reliably filter out porn. Compared to today’s search experiences, early web search was more of a questing experience. By quest, I mean taking an active role in navigating and discovering content, in ways that personalized, curated search from platforms like Google and Facebook have largely usurped with audience targeted advertising. Let me give you an example of an adventurous early web search expedition. There was a time when searching for song lyrics for “Small Town Boy” could lead you to locating the first German fan page for Jimmy Somerville. These days, if you search for the song lyrics, Google will excerpt the lyrics from a website like LyricFind.com. When you move from a questing experience to a precise, algorithmic experience, search becomes routine and relatively prescriptive. You may get exactly what you want with Google Search but you will likely lose a lot of the serendipitous features and access to weird, heterogeneous content that made the early web so fun and exciting to explore.
Today, when we talk about “search” we’re usually not thinking about browsing indexes or visiting a webpage. Instead, we’re thinking about scrolling and swiping information from feeds and apps that bring together lots of different content and user profiles into one stream. Or maybe we’re expecting a precise answer to be served up as an extracted snippet of information from an online resource. Most contemporary search features, especially search within platforms like Facebook, Amazon, or the App Store have monetized the process even further by collecting more and more user data to the point where tracking user behavior like search terms and browsing habits are nearly always required for people to make use of these increasingly essential services. When we ask ourselves what we’ve lost in considering these earlier search engines, we should try and imagine all the possibilities we’ve foreclosed by granting a monopoly on searching all the worlds’ online, digital information to one firm like Google, and then ask ourselves: how else can I surf the web?
Distinguished Research Professor, Information Studies, University of California Los Angeles, and the author of Big Data, Little Data, No Data: Scholarship in the Networked World
In the ‘90s, Yahoo and Altavista did pretty well. But computerized information retrieval is a very old field, dating back at least to the 1950s. The first commercial online remote access systems date back to the early 1970s.
Google did not invent information-retrieval by any means—it built on very old methods of documentation, such as those of Paul Otlet, who invented the Universal Decimal Classification in the 1930s, and was among the parents of modern information science.
The history of online information-retrieval is discipline-specific—very deep specialist indexing in the fields of medicine, metallurgy, materials science, chemistry, engineering, education, the social sciences. We had very good databases online by the early 1970s that were commercially available—you paid by the connect minute.
Some of Google’s most basic principles come out of td-idf, or Text Frequency Times Inverse Document Frequency, a notion that came out of a Cambridge doctoral dissertation in 1958 by Karen Spärck Jones. Her method involved looking for the frequency of a term in a body of work, and dividing that by the inverse of how often the documents occur. She’s really a pioneer, and would later consult for Google, along with many other notable information scholars. Page and Brin were definitely deeply schooled in this history.
Google came out of the Digital Libraries Initiative, a project led by the National Science Foundation and involving 8 or 10 different federal agencies. I had funding from it, and recall the all-hands meeting, at which Brin and Page had a poster proposing Google. I remember thinking: this is really cool, they’ve reinvented bibliometrics for the Web.
Bibliometrics is a means to create links between documents and then follow the network. This method is especially useful to pursue topics where terminology changes over time. For example, if you wanted to find what preceded modern abortion discussions, you’d go to a Roe v. Wade discussion from the mid-1970s and look for everything it cited and everything that cited it, so you can go in both directions.
The Science Citation Index, also begun in the 1950s, brought old principles of library science to modern technology. Bibliometrics and citation indexing are ideas that may be traced back centuries to developments like biblical annotation.
Associate Professor of Information Studies and Co-Director of the UCLA Center for Critical Internet Inquiry at UCLA, and the author of Algorithms of Oppression: How Search Engines Reinforce Racism
One of the most important dimensions of early internet information sharing was that subject matter experts, from librarians to scholars to expert hobbyists, were harnessed to cultivate and organize knowledge. What this did was make the humans involved in these practices visible, even as AI and search tools were developed. We understood that people power is what made sharing happen online, and we sought to figure out what was credible based on pockets of websites managed by organizations, especially by universities and research organizations.
The first search engines were, in fact, virtual libraries, and many people understood the value of libraries as a public good. As automation increased, and librarians and experts were replaced with AI, we lost a lot. The public good that could have been realized was replaced by massive advertising platforms, like Yahoo! and Google.
Now, expertise is outsourced and often optimized content, paid for by the highest bidder in AdWords. This has led to a big gap between knowledge and advertising in search engines, especially when trying to understand complex issues. In some ways, search has undermined our trust in expertise and critical thinking, backed by investigated facts and research, and left us open to manipulation by propaganda. Search engines may be great in helping us find banal information, but they have also desensitized us to the value of slow, deliberate investigation—the kind that makes for a more informed democracy.
Associate Professor, History, University of Waterloo, and the author of History in the Age of Abundance: How the Web is Transforming Historical Research
Google was, of course, not the first search engine for the web. Dating back to 1993, there was the Wandex (or World Wide Web Wanderer) which measured the web and led to a searchable index; to Lycos and Infoseek in 1994 and directories like Yahoo! in 1995.
A lot of these early search engines or directories, however, were fairly clunky. If you were a website creator, you would in many cases have to fill out a form to be added to the directory, or would need to insert fairly cumbersome meta tags into your HTML. By the mid-1990s, as more and more people began to create websites and host them on third-party platforms, they did not always register their sites.
Part of this is because early websites could rely on hyperlinks–far more so than we do today, in our age of search–to bring visitors to their sites.
The WebRing is a great example of this. The WebRing was developed in 1995 by a young software developer named Sage Weil. WebRings were groups of websites that were topically unified. So, people interested in old cars would join an automobile enthusiast WebRing, cat lovers a cat-focused WebRing, and so forth. On the bottom of these pages would be a WebRing interface, encouraging users to go to the “next” site or the “previous” site, or even to an overall index of everybody who had joined the ring.
This was a pretty democratic and accessible method for discovering sites. Anybody could start a web ring, anybody could join one if the administrator thought they fit into the community. Crucially, they formed a new way to connect people. The heyday of WebRings lasted until around 2000, when the technology ended up in the hands of Yahoo! and some management changes ended up alienating users.
I don’t want to be unduly nostalgic: I wouldn’t want to go back to a world where we discovered content mostly through hyperlinks, and I use Google as much as anybody else. But the way that Google works, thanks to PageRank, is that the more links that a site has coming into it from influential venues, the higher up in the search results pages it goes. This has the effect of funneling traffic to a few big winners. If I search for “cats,” I might explore the top dozen or so of almost four billion results. Somewhere in those billions of pages there are undoubtedly cool homepages by people who just really love their cats. In 1998, clicking through a webring, there was a chance I would have serendipitously discovered some fascinating content, or began to feel some community through finding like-minded people. That’s harder to find with Google.
Associate Professor of the Practice in Media Arts and Sciences at MIT Media Lab, Director of the Center for Civic Media at MIT, and the author of Digital Cosmopolitans: Why We Think the Internet Connects Us, Why It Doesn’t, and How to Rewire It
Well, in those dark days, we used several different search engines, which ran on two different philosophies: TFIDF and human curation.
TF-IDF stands for “Term Frequency Inverse Document Frequency.” What that means is that a search engine took your query—“mule power”—and looked for documents that contained the term. But it also considers how common the term is across the corpus as a whole, to avoid overmatching on very common terms. So in searching for “mule power”, a TF-IDF engine is likely to prefer documents that mention mules over those that mention power, because power is a more common word than mule.
TF-IDF is vulnerable to a very specific sort of hacking. If I want to sell you my new mule-powered web browser (they were all the rage in the early 1990s...), I just post a web page that says “mule power” over and over. There’s no document on the web that’s a better match than that for your query, so I’ll come up #1 every time. That’s the weakness that led Larry Page and Sergey Brin to work on Page Rank. The idea was that pages like my spam page would be unlikely to be linked to, whereas helpful pages would have lots of incoming links. Google basically married TF-IDF to Page Rank to launch their initial search engine. (People figured out how to game page rank as well, creating farms of webpages that all said “mule power” and linked to one another. Google created more complex algorithms in response. Progress. People stopped using mule powered browsers and the steam browser became the new hotness. Literally—you could burn yourself really badly on one if you weren’t careful.)
Lycos, which I briefly worked for after they bought Tripod, the company I helped launch, ran on TF-IDF, as did Excite, HotWired and Altavista, which I remember as being the best of the bunch.
TFIDF never worked especially well. As time went on, smart search engines discovered that 30%-50% of queries could be solved with hand-curated search pages. For instance, if you searched “mule race results,” finding you a page that prominently mentioned that phrase was probably not helpful—sending you to the front page of the AMF (the American Muleracing Federation) would be a better result. Lycos served at least 30% hand-crafted results pages when I left in 1999.
Yahoo, by contrast, initially ran on a completely human curated basis. It wasn’t a search engine, but a directory. When you searched for “mule racing”, it would show you where mule racing fit in various hierarchies:
Sports -> Sports Leagues -> Racing -> Mule Racing
and then link to AMF, OOM (Only Ornery Mules) and ESPN (Entertainment and Livestock Programming Network)
Law -> Animal Abuse -> Mule Racing
and then to PET’eM (People for the Ethical Treatment of Mules)
What was great about this is that it could show you how one entity (AMF) fit into the larger world of mule-racing. It was particularly terrific if you were researching companies, as you could quickly find potential competitors or different suppliers. But it was a royal pain in the ass to build, requiring actual human taxonomists to look at sites and figure out where they landed in the hierarchy. And god help you when someone invented something new, like the steam-powered racing mule. Does that go under mule racing, or steam power? Both? Or a new category entirely to recognize the advent of new sporting leagues like NASCAR (National Active Steam Cattle Associated Racing)?
Yahoo! worked really well for the first few years of the web, but it was unwieldy and breaking down by 1997 or so—they began outsourcing their search to other companies (Excite at first... Bing now.) I do miss it, if only because it was fascinating to see the ways people had chosen to organize the whole of human knowledge. (Melvil Dewey assigned the 200s to “religion” and then dedicated 220-280 to various different topics about the Bible. The 290s are about “other religions”... including Buddhism, Hinduism, etc.)
It’s hard to imagine Yahoo coming back—it’s just too much damned work. In a sense, human-curated search pages have made something of a comeback. Much of the Google results page is not a TF-IDF type of web search but a page constructed out of various database queries - search for the weather, and Google uses geolocation to determine where you are and finds local weather news from a db. I actually think pages curated by humans - librarians working together Wikipedia style, for instance - might be a great solution to how to handle rapidly emerging topics that tend to be hijacked by political extremists or disinfo merchants.
As for what I miss: I miss the mules. My mule-powered Netscape browser was slow, but I miss those gentle rhythms of grazing the web.
Do you have a burning question for Giz Asks? Email us at firstname.lastname@example.org.