Quotebank
Quotebank is a corpus of 178 million quotations extracted from 162 million English news articles published between 2008 and 2020. These quotations are well-labeled and rich in information, and it would be beneficial if the data is more accessible to non-tech people who might be interested, such as news reporters. This project is to a tool for such needs that can be used to discover insights in this large dataset.
For things related to the source code, please refer to development.
Frequently asked questions
Why was my search result empty?
See an empty result like this?

There are several possible reasons:
- The textual query you entered does not match any quotation or article in the database. Try to rephrase your query. If you are searching with the article-centric data, you can also try to search with the context of quotations by selecting the option in the dropdown menu.
- Your search filters are too strict. Try to remove some filters and see if you can get any results.
Why does the link to the original source news article not work?

The link to the original source news article might be broken or pointing to a different webpage. The data was collected between 2008 and 2020, and many websites might have changed their domain names or even gone out of business. We are sorry for the inconvenience, but there’s little we can do in this regard (see https://en.wikipedia.org/wiki/Link_rot for additional details).
If you are interested in the source of the quotation, you could try other listed sources or search for the quotation on the Internet. A good place to start would be to search for the quotation on Google. Alternatively you could consider entering the link to the source article on the Internet Archive’s Wayback Machine which may have an archived copy of the webpage.
What are QIDs?
QIDs, or Wikidata IDs, are unique identifiers assigned to every item in Wikidata, a free and collaborative knowledge base of structured data. Each QID starts with the letter ‘Q’ followed by a series of numbers, for example, Q23 represents “George Washington” in Wikidata.
What is an unknown speaker?

In case the speaker of a quotation is unknown, we will display Unknown speaker
as the speaker. This could happen for a variety of reasons including:
-
The speaker may not have been mentioned in the article.
-
The name of the speaker was mentioned in the article, but could not be mapped to a corresponding Wikidata entry. Alternatively, this could also happen if the speaker does not have a Wikidata entry.
-
As machine learning models are not perfect and ours does not have 100% recall, it may occur that the name of the speaker was mentioned in the article, but our machine learning model was unable to identify it.
Why does the speaker of a quotation have multiple QIDs?
The speaker of a quotation might have multiple QIDs. This happens when the identified speaker name matches multiple Wikidata entities. For instance, the name John Smith
matches both the English soldier, explorer, and writer John Smith and the English botanist John Smith. In this case, we will display both QIDs.
How is my query used to find the quotations?
When you enter your query on the website it is sent to Elasticsearch to find matching quotations or articles. At a high-level, Elasticsearch conducts a fuzzy match of your query against indexed content by breaking down the query text into smaller pieces. Specifically, it employs techniques like full-text search, tokenization, and relevance scoring to locate quotations closely aligned with your query, presenting the most relevant results based on their match score with your search terms. For additional details, please see the Elasticsearch documentation.
For a more detailed explanation on how to use the website scroll down to How to use ↓.
Features
Two different types of data
Quotebank contains two types of data:
- Quotation-centric data: quotations are aggregated across all their occurrences in the news article data. Each entry in this dataset is a quotation along with the speaker of this quotation, date of publication, links to articles that contain this quotation, etc..
- Article-centric data: each entry in this dataset is a news article that contains one or more quotations.
You can choose which one to search on.
Various search criteria/filters
For the quotation-centric dataset, you can apply filters based on the following attributes:
- Name of the speaker (e.g., Donald Trump)
- Nationality of the speaker (e.g., United States of America)
- Gender of the speaker (e.g., Male, Female)
- Occupation of the speaker (e.g., Politician, Tennis player)
- URL domain of the website where the quotation was published (e.g., nytimes.com)
- Minimum number of a quotation’s occurrences
- Time window of publication (e.g., between 1 Jan 2016 and 31 Dec 2016)
For the article-centric dataset, on the other hand, we support text queries on the context of quotations.
For all text queries, exact match search is also supported.
Auto-complete
For your convenience, we support auto-complete for some fields, like name of speaker, nationality of speaker, etc..

Share the result
To share your result, click on the “Share” button. A link will be copied, and you can share it with others. They can access our website with the link. All fields will have been filled for them. A simple click on “Submit” will bring the results to them.
Download the result
To download the retrieved result, click on the “Download” button. The retrieved entries will be packed as a JSON file.
How to use
When you first enter the app, you should see something like this:

This is the basic search mode.
- The small question button next to the Quotebank logo links to this page.
- Type in what you want to search for and click the “Go” button, and quotations that match your search text will show up in the result page.
- If you have no idea what to search for, you can also click on the small “Or try me” button. You will see the results of a sample search text we picked for you.
- If you want to explore more about this app, click the “Advanced search” button.
Advanced search
This is what the advanced search mode looks like:

- Click on this button if you want to switch back to the basic mode.
- You can choose which type of data you are interested in.
- Quotation-centric: you will search on individual quotations.
- Article-centric: you will search on news articles. Each news article in the search result contains some quotations.
In this picture we show the quotation-centric option first.
- You can type keywords you want to use for the search here.
- Tick if you want exact match. For example, there are two quotations:
We will win the war and bring everyone home
and The war is cruel. No one will win
. If the search text is win the war
, both quotations will match if you search without exact match, while only the first one will match if exact match is enabled.
- This is the speaker filter. You can limit results to quotations produced by a certain speaker. We support auto-complete here. You can type in a few characters, and suggested options will show up, as displayed in auto-complete. Note: you must select one option here. If none of the suggested options match what you are looking for, this filter will be ignored.
- This is the nationality filter. You can limit results to quotations produced by speakers with a certain nationality, e.g.,
United States of America
. Auto-complete is supported here as well. Note: you must select one option here. If none of the suggested options match what you are looking for, this filter will be ignored.
- This is the gender filter. You can limit results to quotations produced by speakers with a certain gender, e.g.,
male
. Auto-complete is supported here as well. Note: you must select one option here. If none of the suggested options match what you are looking for, this filter will be ignored.
- This is the occupation filter. You can limit results to quotations produced by speakers with a certain occupation, e.g.,
Politician
. Auto-complete is supported here as well. Note: you must select one option here. If none of the suggested options match what you are looking for, this filter will be ignored.
- This is the URL domain filter. Quotations were extracted from all kinds of websites, and we recorded where they were from. You can limit results to quotations published in a certain web domain, e.g.,
nytimes.com
.
- You can also filter quotations based on how many times they appeared in different places. For instance, if a quotation appears in 3 different news articles, its number of occurrences is 3.
- You can also specify the time window you want to search for. Only quotations published in this time window will be considered.
- After everything is set, click on this “Go” button to see the results.
- If you have no idea what to search for, you can also click on the small “Or try me” button. You will see the results of a sample search text we picked for you.
Article-centric option
If you want to search with the article-centric data, you will see fewer options:

- Click on this button if you want to switch back to the basic mode.
- You can choose which type of data you are interested in.
- Quotation-centric: you will search on individual quotations.
- Article-centric: you will search on news articles. Each news article in the search result contains some quotations.
In this picture we show the quotation-centric option first.
- You can type keywords you want to use for the search here.
- Tick if you want exact match. For example, there are two quotations:
We will win the war and bring everyone home
and The war is cruel. No one will win
. If the search text is win the war
, both quotations will match if you search without exact match, while only the first one will match if exact match is enabled.
- You can also search with the context of quotations. Context is the content before and after a quotation. For instance, given a quotation
I am disappointed at him
and its context The scandal was a huge shock to everyone in the company
, said John, the CEO of the company
, and search text scandal
. This quotation will not match if you search without context.
- You can also specify the time window you want to search for. Only quotations published in this time window will be considered.
- After everything is set, click on this “Go” button to see the results.
- If you have no idea what to search for, you can also click on the small “Or try me” button. You will see the results of a sample search text we picked for you.
Result page
Quotation-centric
If you search with the quotation-centric option, you should see something like this:

- The number of results and the time this query took are displayed here. If there are too many results, it will display
Found more than OOO results. Only the first OOO results are displayed
.
- This “Share” button will give you a link where you can share your query. Using the link will bring you to the search mode page, and all the filters you used will be pre-filled. You only need to click on the “Go” button to see the results.
- If you want to download the results, click on this “Save results” button. The results will be packed in JSON.
- The histogram shows the distribution of matched quotations over the time window you specified in the query. If there are too many results, the number here will be more than the number of quotations displayed.
- Each page displays 10 quotations. You can click on the “Next” and “Prev” button to see other results.
- Each quotation is displayed in a box. This text in this box is the quotation itself.
- This is the speaker of the quotation. If you click on the name, you will be brought to the Wikidata page of this speaker. Note that since many people might have the same name, the speaker is ambiguous, which means the displayed Wikidata page might not be of the actual speaker.
- This is the date and time when the quotation was published.
- These links are where the quotation was extracted from. Click on them to see the original article (link might have been dead).
Article-centric
If you search with the article-centric option, you should see something like this:

- The number of results and the time this query took are displayed here. If there are too many results, it will display
Found more than OOO results. Only the first OOO results are displayed
.
- This “Share” button will give you a link where you can share your query. Using the link will bring you to the search mode page, and all the filters you used will be pre-filled. You only need to click on the “Go” button to see the results.
- If you want to download the results, click on this “Save results” button. The results will be packed in JSON.
- The histogram shows the distribution of matched articles over the time window you specified in the query. If there are too many results, the number here will be more than the number of articles displayed.
- Each page displays 10 articles. You can click on the “Next” and “Prev” button to see other results.
- Each article is displayed in a box. This text in this box is the title of the article.
- This is the date and time when the article was published.
- These are the quotations inside this article. The quotations themselves and their speakers are listed.
- If the list is too long, by default it collapses. Click on the “Show more” button to see the full list.
Mobile support
Recently, Quotebank has been optimized to support mobile devices.
