In April 2018, Mark Zuckerberg testified before the Senate Judiciary and Commerce Committee, regarding data and privacy issues. His testimony included responses to questions like, “How do you sustain a business model in which users don’t pay for your service?” Zuckerberg answered, “Senator, we run ads.”
This isn’t the first time — nor will it be the last — that a member of Congress or a regulator will demonstrate a lack of understanding of technology being considered for regulation. And as demonstrated by the recent Senate hearing AI presents even greater challenges for legislators and regulators.
The New York Times recently published an article highlighting how the federal government and regulators have been “hands-off” about regulating artificial intelligence so far. It will be imperative that lawmakers and regulators understand Generative AI to properly regulate it.
The same logic will also apply to attorneys. They must have a foundational understanding of how AI works to be able to effectively advise clients on matters pertaining to AI. Here, we’ll examine the evolution of search functions to better understand how we got to the AI that’s available today.
This is the first in a series of articles that will help lay a foundation so attorneys can better understand how AI works.
Our journey to ChatGPT actually begins with Gutenberg. Around 1440, Johannes Gutenberg invented the movable type printing press. The result of his invention expanded the ability to communicate and had profound impacts on culture, religion, politics, power structures, and society. His invention influenced the Renaissance and the Protestant Reformation in Europe. The printing press was a tremendous technological advancement and spurred the development of copyright law.
By the end of the 1500s, published works were complex enough that some publications included indexes of key words and concepts. An index provides the ability to quickly find something important in a publication without reading the entire publication. A reader simply goes to the back of the publication, looks through the listing and is directed to the specific page(s) where the subject matter appears.
From Publication Indexes To Search Engines
Most people think of Google when they think of full-text searching, but the first commercial applications date back to the 1960s. The Dialog service was developed by Lockheed in 1966 and was commonly used in law firms in the 1980s and 1990s, and Mead Paper developed a search engine in the same timeframe. They launched Mead Data Central in 1973. Searching legal documents was one of the first commercial applications.
The secret of how a search engine works is quite simple. Rather than a manually curated index of important words, computing technology enabled the creation of a comprehensive index of all words in a collection of documents.
While a human referencing an index would be satisfied with simply knowing the page number to seek out in a book, a “search index” is more comprehensive. It inventories the exact location of each word within a document for use by an algorithm. “Noise words” like “a”, “of,” and “the” are excluded.
By creating a search index, Boolean searching was made possible. A user could search “securities” and “fraud” and be presented quickly with all documents in a database that contained the words “securities” and “fraud.”
Additionally, a search algorithm could support other logical search operators (e.g., “not”) to exclude words from a search result. And since the search index knows the exact location of each word in every document, the ability to search “securities” within 10 words of “fraud” is also easy for the search algorithm.
Search engines do not comb through every document. That takes too long, even for a computer. Search engines are efficient because they take the same shortcut that a human does. It accesses its search index much like a person going to the back of a book to find a word in an index. That is the magic and mystery of full-text searching.
Over time, indexing has become much more sophisticated. For example, phrases and legal terms (e.g., “consequential damages”) would be inventoried and indexed as if they were a single word. This would eliminate search results where the words “consequential” and “damages” were mentioned in a document, but the document had nothing to do with the legal term “consequential damages.”
Another example of how search indexing has improved over the years relates to a thesaurus. A human might reference a thesaurus to find another word with a similar meaning. A search algorithm can do the same thing. For example, in Louisiana the word “parish” is equivalent to the word “county” in other parts of the country. A search index that incorporates a thesaurus of terms can ensure that a user searching for “counties” will receive a search result that includes “parishes” in Louisiana.
Boolean search was popular well into the 1990s. It was precise and gave very accurate and repeatable results. Some expert searchers still prefer Boolean searching.
Natural Language Search
As computing power increased, new approaches enabled more resource-intensive approaches to searching. Systems were developed to allow users to enter phrases in “plain English” rather than using complex Boolean connectors. This was an early application of natural language processing, and the legal industry was an early adopter of it.
In Boolean search, it was common for a search result to list documents in chronological order, starting with the most recent. A user might sift through hundreds of documents to find the most relevant document. But when plain English searching, the results are provided in order of the most relevant to the least.
Entering a sentence like “Show me cases where compensatory damages were denied in a car accident” is an example of a plain English search. The search algorithm will identify all documents that have any of the words in the search query.
The secret behind early plain English searching related to tweaks in the indexing and algorithm. The search algorithm would rank documents highest based upon two factors. First, how frequently do the words occur in individual documents? The more frequent, the higher the ranking. Second, the algorithms considered how unique a word was in the overall database. Words like “compensatory” might be infrequent in the database, so documents containing compensatory would get pushed toward the top of search results too.
Google’s search algorithm follows the natural language “plain English” search at massive scale. But Google makes its money off paid advertising, much like Zuckerberg’s Facebook. So the algorithm and search results support making money. Google adjusts its search algorithm regularly. Google prioritizes search results on many factors including the quality of sites. Many other factors go into the Google algorithm such as user location and filtering objectionable content.
Full-text searching sounds complex, but it’s understandable when the core concepts are related to metaphors like a book’s index. Additionally, attorneys have been using an early version of natural language processing, “plain English” search, for the better part of their careers.
In ChatGPT, a user’s “plain English” query results in a conversational answer in “plain English.”
Next month, we’ll explore how natural language processing that powers searching relates to ChatGPT.
Ken Crutchfield is Vice President and General Manager of Legal Markets at Wolters Kluwer Legal & Regulatory U.S., a leading provider of information, business intelligence, regulatory and legal workflow solutions. Ken has more than three decades of experience as a leader in information and software solutions across industries. He can be reached at firstname.lastname@example.org.
Leave a Reply