Unified Patents' Methodology for OPAL GenAI Landscape
Summary
Unified Patents' Objective PAtent Landscape ("OPAL") for Generative Artificial Intelligence (GenAI) objectively assesses and predicts the relevance of patent publications to core GenAI technologies and applications. It was created using a machine learning algorithm trained on GenAI training data provided to Unified Patents by EconSight. A brief summary of the methodology follows:
- Universe of Patents Subject to Analysis: Approximately 2.1 million patent publications worldwide potentially relevant to GenAI.
- GenAI Training Set Developed by Experts: The training set used for this landscape was developed and provided to us by the experts at EconSight. EconSights was previously commissioned by WIPO to conduct a study on GenAI technology and patents.
- AI Training: Patents were vectorized using FastText and CPC embeddings, and a binary classification algorithm was trained to predict relevance to the GenAI domain.
- ML Performance: Training adhered to recognized machine learning best practices, achieving a high F-1 score indicative of robust predictive accuracy.
Basis for Using Machine Learning to Predict GenAI Relevance
The explosive growth and broad applicability of GenAI technologies have resulted in a vast and continually expanding patent landscape. Accurately identifying patents truly relevant to core GenAI concepts is crucial for patent licensing, portfolio management, and competitive intelligence. Given the impracticality and high cost of manual expert review at scale, Unified Patents has applied its established OPAL approach—characterized by objective, transparent, cost-efficient, and reliable machine learning analytics—to precisely identify patents relevant to GenAI.
Creating a Training Set
The training set used by Unified Patents was provided to us by EconSight. This training set is a refinement of the original one developed by EconSight and commissioned by the WIPO. A detailed description of the original training set can be found in the WIPO report and is summarized below in the following two paragraphs.
EconSight employed a robust, two-stage classification process. The initial stage aimed at achieving high recall by integrating traditional patent searches—utilizing generic terms such as "generative AI" and specific technical concepts including Generative Adversarial Networks (GANs), Autoregressive Models, Diffusion Models, Large Language Models (LLMs), and Variational Autoencoders—with advanced AI-driven semantic retrieval techniques.
In the second stage, EconSight refined this initial dataset through machine learning classification, utilizing a BERT (Bidirectional Encoder Representations from Transformers) model fine-tuned with labeled patent data. Positive seed patents, which clearly represented GenAI, and negative seed patents, closely related AI patents explicitly excluded from GenAI, were used to train the model, enhancing its ability to discriminate between relevant and non-relevant patents. Sample-based evaluations indicated strong model performance, with a precision of 0.8, recall of 0.9, and an F1 score of 0.85.
The updated training set provided by EconSight differs from the original WIPO dataset in several significant ways. First, it is based on a more recent patent corpus. Additionally, the updated approach leverages advanced large language models (LLMs) to achieve more precise classification. Specifically, EconSight created semantic concepts related to generative AI to perform targeted semantic searches. These candidate patents were then analyzed in detail—examining their titles, abstracts, descriptions, and claims—using two distinct LLMs. A definitive label was assigned only when both LLMs agreed on the classification, ensuring high accuracy. Patents with model disagreement were excluded from the labeled dataset, and a further quality control step involved manual review of approximately 2% of the patents.
Moreover, EconSight emphasized the identification and inclusion of "hard negatives," patents conceptually similar to generative AI but ultimately falling outside the defined scope. This enhancement, made possible by the improved capabilities of recent LLMs, allowed for clearer distinctions between foundational generative AI inventions and closely related technologies, such as classical AI methods, simulation-based generation, and domain-specific generative applications (e.g., 3D printing, game design).
Finally, EconSight's updated training set broadly encompasses foundational GenAI technologies (core models) as well as applications spanning various domains, including text and image generation, gaming, robotics, marketing, code generation, and data synthesis, effectively capturing the evolving landscape of generative AI innovations. For non-English patents, EconSight relied directly on standard English translations provided by IFI Claims, ensuring dataset consistency.
Predicting Relevance via Machine Learning
With EconSight's training set, Unified Patents employed positive and negative labels to train a binary classification algorithm specifically targeting GenAI relevance. Unified Patents followed textbook machine learning protocols to ensure rigor and replicability.
Unified Patents vectorized the title, abstract, claims, and CPC codes of each patent in the training set using FastText and CPC embeddings. An optimal dimensionality reduction was performed to mitigate overfitting. An ensemble of high-performing algorithms—including XGBoost and shallow extra-randomized forests—were aggregated to form the final predictive model.
A stratified K-fold cross-validation approach was deployed during training to correct potential optimistic biases due to imbalanced datasets, ensuring the proportional representation of positive and negative labels in both training and test subsets.
Applying the Trained Model to the Universe
With the GenAI-specific training set in hand, Unified Patents defined a comprehensive universe of relevant and related patents. This effort consisted of an initial step that combined the patents from the training set with other patents captured using the same keyword queries and CPC codes that EconSight used in the aforementioned WIPO report, detailed in Appendix A of that report.
This initial collection of patents was then expanded by extended family, then expanded by both forward and backward citations, and then again expanded by extended family. This final universe consisted of 2,152,079 relevant and related GenAI patents.
OPAL Performance
The machine learning algorithms achieved high predictive performance, reflected by an F-1 score of 0.98 – commensurate with Unified Patents' prior landscapes – ensuring a high degree of precision and recall. The F-1 score represents the harmonic mean of precision (the number of true positives divided by the total number of positive predictions) and recall (the number of true positives divided by all patents that should have been identified as positive).
This robust methodology allows Unified Patents to offer stakeholders an objective, transparent, and reliable view of the rapidly evolving GenAI patent landscape.
Comments
0 comments
Article is closed for comments.