[Editor’s note: Recently, i had a conversation with Josh Becker, chairman of Lex Machina and head of legal analytics at LexisNexis, about the fast-growing area of litigation analytics. Becker made the case that not all analytics products on the market are created equal. Of course, he is biased in favor of his own product, but his position is worth airing and considering. So I invited him to put his thoughts in a post. The following is Becker’s article. If you have a different view, I’m happy to consider publishing it. Reach out to me at ambrogi-at-gmail-dot-com or on Twitter to @bobambrogi.]

By Josh Becker

Applying legal analytics to large stores of litigation data can be transformative for lawyers, law firms and their clients, leading to new insights, smarter legal strategies, stronger legal arguments and favorable outcomes. Practitioners who make judicious use of quality analytics tools – and recognize their limitations – can do better legal work more efficiently and enjoy a competitive advantage over those who do not have such tools at their disposal. Analytics can also provide organizations with data-based insights into the business of law, informing a broad range of decisions, ranging from who to hire and how to pitch to prospective clients to where to make business development investments.

However, some legal analytics solutions can produce inaccurate or misleading information, sometimes sending lawyers down the wrong path altogether. Access to a solution with a “legal analytics” label is not a guarantee of good, or even satisfactory, results.

Whether your encounter with legal analytics fulfills the enormous promise of the technology depends – a lot – on the specific tools you use, the size of the database and the quality of the data you are drawing from. It also depends on the level of human domain expertise that your analytics vendor brings to its data mining activities. Machine learning and natural language processing are powerful technologies, but they still need assistance from human beings with practice-specific experience and a deep understanding of the nuances of specialized legal language in order to make sense of the data and be practical for demanding legal use cases.


If you are a lawyer litigating a case, an effective legal analytics tool will allow you to quickly identify and compare cases that are similar to the one you are working on. In very little time, you can determine which tactics and strategies have historically worked in such cases and which have not. Analytics helps you make rational decisions and accurate predictions about the case currently at hand based on those historical outcomes. On the other hand, if you are comparing your case against an incomplete set of cases or the wrong set of cases, or if the historical cases you believe are relevant have missing or erroneous data, then the “insights” gleaned from that information could actually hinder or even harm your case.

In order to be an effective tool, legal analytics requires that you have a very large, comprehensive body of data to work with. It also requires that the data be clean and accurate. That means it must be properly coded, tagged, enhanced and structured so that users can quickly find and grasp the essential details of the most pertinent cases without having to wade through irrelevant cases and useless information.

PACER: The Opportunity and the Challenge

Many legal analytics solutions focus on federal litigation data from PACER, and with good reason. It goes back 20 years, the body of data is huge, and experts estimate it is growing by 2 million cases and tens of millions of documents per year. But PACER data also lacks uniformity and contains “noise,” and if these issues aren’t addressed, analytics tools mining the data are very likely to produce misleading information for practitioners. Even the best AI technology cannot prevent that.

Like any massive database, PACER data contains a number of misspellings and other misleading information, which often reflect errors in court documents themselves. As a result, many analytics tools will overlook critical cases handled by specific attorneys and firms whose information is wrong in the database and the users of those tools will never know it. For one very prominent global law firm, there are more than 100 variations of its spelling in PACER. If the analytics solution you are using has not been trained to recognize and correct such inconsistencies, you could end up making key decisions based on false or incomplete information.

Inconsistencies in PACER data about attorneys and firms are especially problematic when you are looking at historical information. For example, when attorneys change firms, PACER attributes their past cases to their current firm, which can create serious misperceptions of the firm’s expertise. PACER also lacks a mechanism for identifying lawyers who are working on a case pro hac vice. Again, this can cause users of some analytics tools to make assumptions that are not supported by the facts. Analyzing the strengths and weaknesses of opponents is a key use case for legal analytics, but if you are drawing conclusions and developing case strategy on incorrect information, you could be headed for trouble and adverse outcomes.

Nature of Suit (NOS) codes are another problem area with PACER data. NOS codes are sometimes applied in ways that can be misleading to practitioners. Tools that rely on these codes to classify and filter cases can distort search results by omitting cases from results or by wrongly including other cases. Perhaps the best example of this issue is the landmark copyright case Oracle America, Inc. v. Google Inc., which is classified in PACER as a patent case. If you are using an analytics solution that is unable to recognize the case’s relevance to copyright, the case would simply not appear in searches for copyright cases – a glaring omission.

PACER also doesn’t have codes for certain important practice areas, like commercial law and trade secrets. These cases end up filed under other NOS codes, and there appears to be no consistent standard for doing so. If are using a tool that relies on NOS codes for classification, your search for all commercial cases presided over by Judge Otero in Central district of California will not find every case, and may well deliver misleading results based on an unrepresentative body of data.

Many analytics tools classify cases based on case dockets and NOS codes – not full case content – and this is likely to lead to misperceptions, misguided legal strategies, increased risk and poor outcomes. PACER is an extraordinary body of data, but it has many known inaccuracies. The best tools will be equipped to recognize and correct them.

The Importance of Practice-Specific Filters and Data Tags

While most analytics tools offer basic filters for variables like district, judge, firm, case type and time period, they typically lack practice-specific case tags. This is a major shortcoming.

Let’s say you’re an employment lawyer looking for cases concerning Title VII discrimination. Unless your analytics tool tags such cases, you will likely have to pore through results for tens of thousands of employment cases to find what you’re looking for, and you are likely to make mistakes and oversights in the process. Does the tool you’re using allow you to exclude “hurricanes” from insurance cases? If not, you might arrive at the wrong conclusion. PACER data shows that the Eastern District of Louisiana has seen over 7,500 insurance cases (8% of all cases), more than any other district. If you conclude from that it is an insurance-friendly district and should seek a transfer to that venue, you would be mistaken. Excluding hurricane-related cases reduces the total to about 1,100 cases, or 1% of all cases.

AI technologies can’t detect these kinds of nuances and anomalies on their own. They need to be continually “trained” by knowledgeable lawyers to sort through the data and apply practice-specific filters and data tags. Human intervention is also necessary to teach machines to capture critical information about case timing, damages, findings, remedies and other factors that are often essential to developing case strategy. PACER headers do not capture such information, and tools that rely on those headers will not be able to find it.

If you are considering buying a legal analytics solution, don’t rely on marketing brochures and promises. Every solution should be tested thoroughly, and sales reps should be quizzed in detail about how their organization cleans, structures and tags the data, and the extent to which those processes are informed by the practice-specific insight of human experts. Be sure to develop challenging, detailed research questions across a broad range of practice areas and use cases, and then compare the results you get from various tools. PACER is a very large litigation database that provides very broad coverage, but analytics tools that fail to penetrate deeply into the data may produce misleading information and actually increase your exposure to risk.