Introduction
If you're new to AI
Read these two sections first. They will help you understand the current lay of the land, along with what all this new jargon means so that you're not completely lost.
Getting started with code
This is the first part of the "Choosing ML Models" section.
ai-guide-pick-a-model-test-a-model
How do I pick a model?¶
Unlike other guides, this one is designed to help pick the right model for whatever task you're trying to do, by:
- teaching you how to always remain on the bleeding edge of published AI research
- broadening your perspective on current open options for any given task
- not be tied to a closed-source / closed-data large language model (ex OpenAI, Anthropic)
- creating a data-led system for always identifying and using the state-of-the-art (SOTA) model for any particular task.
We're going to hone in on "text summarization" as our first task.
So... why are we not using one of the popular large language models?¶
Great question. Most available LLMs worth their salt can do many tasks, including summarization, but not all of them may be good at what specifically you want them to do. We should figure out how to evaluate whether they actually can or not.
Also, many of the current popular LLMs are not open, are trained on undisclosed data and exhibit biases. Responsible AI use requires careful choices, and we're here to help you make them.
Finally, most large LLMs require powerful GPU compute to use. While there are many models that you can use as a service, most of them cost money per API call. Unnecessary when some of the more common tasks can be done at good quality with already available open models and off-the-shelf hardware.
Why does using open models matter?¶
Over the last few decades, engineers have been blessed with being able to onboard by starting with open source projects, and eventually shipping open source to production. This default state is now at risk.
Yes, there are many open models available that do a great job. However, most guides don't discuss how to get started with them using simple steps and instead bias towards existing closed APIs.
Funding is flowing to commercial AI projects, who have larger budgets than open source contributors to market their work, which inevitably leads to engineers starting with closed source projects and shipping expensive closed projects to production.
Our First Project - Summarization¶
We're going to:
- Find text to summarize.
- Figure out how to summarize them using the current state-of-the-art open source models.
- Write some code to do so.
- Evaluate quality of results using relevant metrics
For simplicity's sake, let's grab Mozilla's Trustworthy AI Guidelines in string form
Note that in the real world, you will likely have use other libraries to extract content for any particular file type.
In [ ]:
import textwrap
content = """Mozilla's "Trustworthy AI" Thinking Points:
PRIVACY: How is data collected, stored, and shared? Our personal data powers everything from traffic maps to targeted advertising. Trustworthy AI should enable people to decide how their data is used and what decisions are made with it.
FAIRNESS: We’ve seen time and again how bias shows up in computational models, data, and frameworks behind automated decision making. The values and goals of a system should be power aware and seek to minimize harm. Further, AI systems that depend on human workers should protect people from exploitation and overwork.
TRUST: People should have agency and control over their data and algorithmic outputs, especially considering the high stakes for individuals and societies. For instance, when online recommendation systems push people towards extreme, misleading content, potentially misinforming or radicalizing them.
SAFETY: AI systems can carry high risk for exploitation by bad actors. Developers need to implement strong measures to protect our data and personal security. Further, excessive energy consumption and extraction of natural resources for computing and machine learning accelerates the climate crisis.
TRANSPARENCY: Automated decisions can have huge personal impacts, yet the reasons for decisions are often opaque. We need to mandate transparency so that we can fully understand these systems and their potential for harm."""
Great. Now we're ready to start summarizing.
A brief pause for context.¶
The AI space is moving so fast that it requires a tremendous amount of catching up on scientific papers each week to understand the lay of the land and the state of the art.
It's some effort for an engineer who is brand new to AI to:
- discover which open models are even out there
- which models are appropriate for any particular task
- which benchmarks are used to evaluate those models
- which models are performing well based on evaluations
- which models can actually run on available hardware
For the working engineer on a deadline, this is problematic. There's not much centralized discourse on working with open source AI models. Instead there are fragmented X (formerly Twitter) threads, random private groups and lots of word-of-mouth transfer.
However, once we have a workflow to address all of the above, you will have the means to forever be on the bleeding age of published AI research.
How do I get a list of available open summarization models?¶
For now, we recommend Huggingface and their large directory of open models broken down by task. This is a great starting point. Note that larger LLMs are also included in these lists, so we will have to filter.
In this huge list of summarization models, which ones do we choose?
We don't know what any of these models are trained on. For example, a summarizer trained on news articles vs Reddit posts will perform better on news articles.
What we need is a set of metrics and benchmarks that we can use to do apples-to-apples comparisons of these models.
How do I evaluate summarization models?¶
These steps below can be used to evaluate any available model for any task. It requires hopping between a few sources of data for now, but we will be making this a lot easier moving forward.
Steps:
- Find the most common datasets used to train models for summarization.
- Find the most common metrics used to evaluate models for summarization across those datasets.
- Do a quick audit on training data provenance, quality and any exhibited biases, to keep in line with Responsible AI usage.
Finding datasets¶
The easiest way to do this is using Papers With Code, an excellent resource for finding the latest scientific papers by task that also have code repositories attached.
First, filter Papers With Code's "Text Summarization" datasets by most cited text-based English datasets.
Let's pick (as of this writing) the most cited dataset -- the "CNN/DailyMail" dataset. Usually most cited is one marker of popularity.
Now, you don't need to download this dataset. But we're going to review the info Papers With Code have provided to learn more about it for the next step. This dataset is also available on Huggingface.
You want to check the following:
- license
- recent papers
- whether the data is traceable and the methods are transparent
First, check the license. In this case, it's MIT licensed, which means it can be used for both commercial and personal projects.
Next, see if the papers using this dataset are recent. You can do this by sorting Papers in descending order. This particular dataset has many papers from 2023 - great!
Finally, let's check whether the data is from a credible source. In this case, the dataset was generated by IBM in partnership with the University of Montréal. Great.
Now, let's dig into how we can evaluate models that use this dataset.
Evaluating models¶
Next, we look for measured metrics that are common across datasets for the summarization task. BUT, if you're not familiar with the literature on summarization, you have no idea what those are.
To find out, pick a "Subtask" that's close to what you'd like to see. We'd like to summarize the CNN article we pulled down above, so let's choose "Abstractive Text Summarization".
Now we're in business! This page contains a significant amount of new information.
There are mentions of three new terms: ROUGE-1, ROUGE-2 and ROUGE-L. These are the metrics that are used to measure summarization performance).
There are also a list of models and their scores on these three metrics - this is exactly what we're looking for.
Assuming we're looking at ROUGE-1 as our metric, we now have the top 3 models that we can evaluate in more detail. All 3 are close to 50, which is a promising ROUGE score (read up on ROUGE).
Testing out a model¶
OK, we have a few candidates, so let's pick a model that will run on our local machines. Many models get their best performance when running on GPUs, but there are many that also generate summaries fast on CPUs. Let's pick one of those to start - Google's Pegasus.
In [ ]:
# first we install huggingface's transformers library
%pip install transformers sentencepiece
Then we find Pegasus on Huggingface. Note that part of the datasets Pegasus was trained on includes CNN/DailyMail which bodes well for our article summarization. Interestingly, there's a variant of Pegasus from google that's only trained on our dataset of choice, we should use that.
In [ ]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
import torch
# Set the seed, this will help reproduce results. Changing the seed will
# generate new results
from transformers import set_seed
set_seed(248602)
# We're using the version of Pegasus specifically trained for summarization
# using the CNN/DailyMail dataset
model_name = "google/pegasus-cnn_dailymail"
# If you're following along in Colab, switch your runtime to a
# T4 GPU or other CUDA-compliant device for a speedup
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load the tokenizer
tokenizer = PegasusTokenizer.from_pretrained(model_name)
# Load the model
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)
In [ ]:
# Tokenize the entire content
batch = tokenizer(content, padding="longest", return_tensors="pt").to(device)
# Generate the summary as tokens
summarized = model.generate(**batch)
# Decode the tokens back into text
summarized_decoded = tokenizer.batch_decode(summarized, skip_special_tokens=True)
summarized_text = summarized_decoded[0]
# Compare
def compare(original, summarized_text):
print(f"Article text length: {len(original)}\n")
print(textwrap.fill(summarized_text, 100))
print()
print(f"Summarized length: {len(summarized_text)}")
compare(content, summarized_text)
Article text length: 1427
Trustworthy AI should enable people to decide how their data is used.<n>values and goals of a system
should be power aware and seek to minimize harm.<n>People should have agency and control over their
data and algorithmic outputs.<n>Developers need to implement strong measures to protect our data and
personal security.
Summarized length: 320
Alright, we got something! Kind of short though. Let's see if we can make the summary longer...
In [ ]:
set_seed(860912)
# Generate the summary as tokens, with a max_new_tokens
summarized = model.generate(**batch, max_new_tokens=800)
summarized_decoded = tokenizer.batch_decode(summarized, skip_special_tokens=True)
summarized_text = summarized_decoded[0]
compare(content, summarized_text)
Article text length: 1427
Trustworthy AI should enable people to decide how their data is used.<n>values and goals of a system
should be power aware and seek to minimize harm.<n>People should have agency and control over their
data and algorithmic outputs.<n>Developers need to implement strong measures to protect our data and
personal security.
Summarized length: 320
Well, that didn't really work. Let's try a different approach called 'sampling'. This allows the model to pick the next word according to its conditional probability distribution (specifically, the probability that said word follows the word before).
We'll also be setting the 'temperature'. This variable works to control the levels of randomness and creativity in the generated output.
In [ ]:
set_seed(118511)
summarized = model.generate(**batch, do_sample=True, temperature=0.8, top_k=0)
summarized_decoded = tokenizer.batch_decode(summarized, skip_special_tokens=True)
summarized_text = summarized_decoded[0]
compare(content, summarized_text)
Article text length: 1427
Mozilla's "Trustworthy AI" Thinking Points:.<n>People should have agency and control over their data
and algorithmic outputs.<n>Developers need to implement strong measures to protect our data.
Summarized length: 193
Shorter, but the quality is higher. Adjusting the temperature up will likely help.
In [ ]:
set_seed(108814)
summarized = model.generate(**batch, do_sample=True, temperature=1.0, top_k=0)
summarized_decoded = tokenizer.batch_decode(summarized, skip_special_tokens=True)
summarized_text = summarized_decoded[0]
compare(content, summarized_text)
Article text length: 1427
Mozilla's "Trustworthy AI" Thinking Points:.<n>People should have agency and control over their data
and algorithmic outputs.<n>Developers need to implement strong measures to protect our data and
personal security.<n>We need to mandate transparency so that we can fully understand these systems
and their potential for harm.
Summarized length: 325
Now let's play with one other generation approach called top_k sampling -- instead of considering all possible next words in the vocabulary, the model only considers the top 'k' most probable next words.
This technique helps to focus the model on likely continuations and reduces the chances of generating irrelevant or nonsensical text.
It strikes a balance between creativity and coherence by limiting the pool of next word choices, but not so much that the output becomes deterministic.
In [ ]:
set_seed(226012)
summarized = model.generate(**batch, do_sample=True, top_k=50)
summarized_decoded = tokenizer.batch_decode(summarized, skip_special_tokens=True)
summarized_text = summarized_decoded[0]
compare(content, summarized_text)
Article text length: 1427
Mozilla's "Trustworthy AI" Thinking Points look at ethical issues surrounding automated decision
making.<n>values and goals of a system should be power aware and seek to minimize harm.<n>People
should have agency and control over their data and algorithmic outputs.<n>Developers need to
implement strong measures to protect our data and personal security.
Summarized length: 355
Finally, let's try top_p sampling -- also known as nucleus sampling, is a strategy where the model considers only the smallest set of top words whose cumulative probability exceeds a threshold 'p'.
Unlike top-k which considers a fixed number of words, top-p adapts based on the distribution of probabilities for the next word. This makes it more dynamic and flexible. It helps create diverse and sensible text by allowing less probable words to be selected when the most probable ones don't add up to 'p'.
In [ ]:
set_seed(21420041)
summarized = model.generate(**batch, do_sample=True, top_p=0.9, top_k=50)
summarized_decoded = tokenizer.batch_decode(summarized, skip_special_tokens=True)
summarized_text = summarized_decoded[0]
compare(content, summarized_text)
# saving this for later.
pegasus_summarized_text = summarized_text
Article text length: 1427
Mozilla's "Trustworthy AI" Thinking Points:.<n>People should have agency and control over their data
and algorithmic outputs.<n>Developers need to implement strong measures to protect our data and
personal security.<n>We need to mandate transparency so that we can fully understand these systems
and their potential for harm.
Summarized length: 325
Now, let's try out another model -- Meta's "BART".
Looking at the PapersWithCode graph, BART has solid results with ROUGE-1.
Similar to Pegasus, BART has a custom version finetuned on CNN data.
In [ ]:
from transformers import BartTokenizer, BartForConditionalGeneration
set_seed(120986)
bart_model_name = "facebook/bart-large-cnn"
# Load the tokenizer
bart_tokenizer = BartTokenizer.from_pretrained(bart_model_name)
# Load the model
bart_model = BartForConditionalGeneration.from_pretrained(bart_model_name).to(device)
In [ ]:
# Using the same parameters as Pegasus, let's try running BART
batch = bart_tokenizer(content, padding="longest", return_tensors="pt").to(device)
summarized = bart_model.generate(**batch, do_sample=True, top_p=0.5, top_k=50, max_new_tokens=500)
summarized_decoded = bart_tokenizer.batch_decode(summarized, skip_special_tokens=True)
summarized_text = summarized_decoded[0]
compare(content, summarized_text)
bart_summarized_text = summarized_text
Article text length: 1427
Mozilla's "Trustworthy AI" Thinking Points: How is data collected, stored, and shared? Our personal
data powers everything from traffic maps to targeted advertising. Trustworthy AI should enable
people to decide how their data is used and what decisions are made with it.
Summarized length: 271
Is this the best that BART can do? Unlikely. We can take this as a starting point to experiment.
You should now have enough of a workflow mapped out to find, select and try these models out for not just summarization but any text-based use-case. Let's start learning and experimenting!
There are many other variables that control the output, but this is a great stopping point to switch over to how to evaluate the results of these and any other model quantitatively for your own use-cases.
Why this guide?
Large language and image/video generation models are opening up new ways to create, plan and execute the things we do at work and home every day.
So far, onboarding into the state-of-the-art within AI has been challenging for engineers of any level who are new to the AI space. Current onboarding options require piecing-together various solutions in disjointed ways. Most of these solutions are also not default open source, and there are vast amounts of new terminology, entities and libraries to learn.
As of this moment, the historical pattern of engineers starting and ending projects with primarily open source components is at risk. Foundation models are very expensive to train from scratch, and open source versions have a hard time competing with well resourced for-profit companies. Mozilla is planting a flag in the ground in this space and creating a new hub for open-source AI projects, kickstarting a vibrant open-source community, with clear shared goals and improved open-source visibility for this important new sector.
Why Mozilla?
Mozilla has been developing and shipping AI through its products and initiatives for many years. From Common Voice to Firefox Translate and beyond, we see in AI the potential to empower humans, make technology more accessible, and improve lives.
But like many others, we also see the risk of real-world harm. We don’t have to cast our gazes to some far-flung science fiction future to envision it: people are already experiencing real-world impacts from this technology, right here and now.
Much will depend on how this technology is developed and deployed, but it will also depend on who guides it. We believe that the open source community has a critical role to play in ensuring that AI develops in a way that is responsible, trustworthy, equitable, and safe. We hope that this guide contributes to mobilizing the open source community to join us in this important work.