quotebank-es

Quotebank

Quotebank is a corpus of 178 million quotations extracted from 162 million English news articles published between 2008 and 2020. These quotations are well-labeled and rich in information, and it would be beneficial if the data is more accessible to non-tech people who might be interested, such as news reporters. This project is to a tool for such needs that can be used to discover insights in this large dataset.

For things related to the source code, please refer to development.

Frequently asked questions

Why was my search result empty?


See an empty result like this?

There are several possible reasons:

Why does the link to the original source news article not work?


The link to the original source news article might be broken or pointing to a different webpage. The data was collected between 2008 and 2020, and many websites might have changed their domain names or even gone out of business. We are sorry for the inconvenience, but there’s little we can do in this regard (see https://en.wikipedia.org/wiki/Link_rot for additional details).

If you are interested in the source of the quotation, you could try other listed sources or search for the quotation on the Internet. A good place to start would be to search for the quotation on Google. Alternatively you could consider entering the link to the source article on the Internet Archive’s Wayback Machine which may have an archived copy of the webpage.

What are QIDs?

QIDs, or Wikidata IDs, are unique identifiers assigned to every item in Wikidata, a free and collaborative knowledge base of structured data. Each QID starts with the letter ‘Q’ followed by a series of numbers, for example, Q23 represents “George Washington” in Wikidata.

What is an unknown speaker?


In case the speaker of a quotation is unknown, we will display Unknown speaker as the speaker. This could happen for a variety of reasons including:

Why does the speaker of a quotation have multiple QIDs?


The speaker of a quotation might have multiple QIDs. This happens when the identified speaker name matches multiple Wikidata entities. For instance, the name John Smith matches both the English soldier, explorer, and writer John Smith and the English botanist John Smith. In this case, we will display both QIDs.

How is my query used to find the quotations?


When you enter your query on the website it is sent to Elasticsearch to find matching quotations or articles. At a high-level, Elasticsearch conducts a fuzzy match of your query against indexed content by breaking down the query text into smaller pieces. Specifically, it employs techniques like full-text search, tokenization, and relevance scoring to locate quotations closely aligned with your query, presenting the most relevant results based on their match score with your search terms. For additional details, please see the Elasticsearch documentation.


For a more detailed explanation on how to use the website scroll down to How to use ↓.

Features

Two different types of data

Quotebank contains two types of data:

You can choose which one to search on.

Various search criteria/filters

For the quotation-centric dataset, you can apply filters based on the following attributes:

For the article-centric dataset, on the other hand, we support text queries on the context of quotations.

For all text queries, exact match search is also supported.

Auto-complete

For your convenience, we support auto-complete for some fields, like name of speaker, nationality of speaker, etc..

Share the result

To share your result, click on the “Share” button. A link will be copied, and you can share it with others. They can access our website with the link. All fields will have been filled for them. A simple click on “Submit” will bring the results to them.

Download the result

To download the retrieved result, click on the “Download” button. The retrieved entries will be packed as a JSON file.

How to use

When you first enter the app, you should see something like this:

This is the basic search mode.

  1. The small question button next to the Quotebank logo links to this page.
  2. Type in what you want to search for and click the “Go” button, and quotations that match your search text will show up in the result page.
  3. If you have no idea what to search for, you can also click on the small “Or try me” button. You will see the results of a sample search text we picked for you.
  4. If you want to explore more about this app, click the “Advanced search” button.

This is what the advanced search mode looks like:

  1. Click on this button if you want to switch back to the basic mode.
  2. You can choose which type of data you are interested in.
    • Quotation-centric: you will search on individual quotations.
    • Article-centric: you will search on news articles. Each news article in the search result contains some quotations. In this picture we show the quotation-centric option first.
  3. You can type keywords you want to use for the search here.
  4. Tick if you want exact match. For example, there are two quotations: We will win the war and bring everyone home and The war is cruel. No one will win. If the search text is win the war, both quotations will match if you search without exact match, while only the first one will match if exact match is enabled.
  5. This is the speaker filter. You can limit results to quotations produced by a certain speaker. We support auto-complete here. You can type in a few characters, and suggested options will show up, as displayed in auto-complete. Note: you must select one option here. If none of the suggested options match what you are looking for, this filter will be ignored.
  6. This is the nationality filter. You can limit results to quotations produced by speakers with a certain nationality, e.g., United States of America. Auto-complete is supported here as well. Note: you must select one option here. If none of the suggested options match what you are looking for, this filter will be ignored.
  7. This is the gender filter. You can limit results to quotations produced by speakers with a certain gender, e.g., male. Auto-complete is supported here as well. Note: you must select one option here. If none of the suggested options match what you are looking for, this filter will be ignored.
  8. This is the occupation filter. You can limit results to quotations produced by speakers with a certain occupation, e.g., Politician. Auto-complete is supported here as well. Note: you must select one option here. If none of the suggested options match what you are looking for, this filter will be ignored.
  9. This is the URL domain filter. Quotations were extracted from all kinds of websites, and we recorded where they were from. You can limit results to quotations published in a certain web domain, e.g., nytimes.com.
  10. You can also filter quotations based on how many times they appeared in different places. For instance, if a quotation appears in 3 different news articles, its number of occurrences is 3.
  11. You can also specify the time window you want to search for. Only quotations published in this time window will be considered.
  12. After everything is set, click on this “Go” button to see the results.
  13. If you have no idea what to search for, you can also click on the small “Or try me” button. You will see the results of a sample search text we picked for you.

Article-centric option

If you want to search with the article-centric data, you will see fewer options:

  1. Click on this button if you want to switch back to the basic mode.
  2. You can choose which type of data you are interested in.
    • Quotation-centric: you will search on individual quotations.
    • Article-centric: you will search on news articles. Each news article in the search result contains some quotations. In this picture we show the quotation-centric option first.
  3. You can type keywords you want to use for the search here.
  4. Tick if you want exact match. For example, there are two quotations: We will win the war and bring everyone home and The war is cruel. No one will win. If the search text is win the war, both quotations will match if you search without exact match, while only the first one will match if exact match is enabled.
  5. You can also search with the context of quotations. Context is the content before and after a quotation. For instance, given a quotation I am disappointed at him and its context The scandal was a huge shock to everyone in the company, said John, the CEO of the company, and search text scandal. This quotation will not match if you search without context.
  6. You can also specify the time window you want to search for. Only quotations published in this time window will be considered.
  7. After everything is set, click on this “Go” button to see the results.
  8. If you have no idea what to search for, you can also click on the small “Or try me” button. You will see the results of a sample search text we picked for you.

Result page

Quotation-centric

If you search with the quotation-centric option, you should see something like this:

  1. The number of results and the time this query took are displayed here. If there are too many results, it will display Found more than OOO results. Only the first OOO results are displayed.
  2. This “Share” button will give you a link where you can share your query. Using the link will bring you to the search mode page, and all the filters you used will be pre-filled. You only need to click on the “Go” button to see the results.
  3. If you want to download the results, click on this “Save results” button. The results will be packed in JSON.
  4. The histogram shows the distribution of matched quotations over the time window you specified in the query. If there are too many results, the number here will be more than the number of quotations displayed.
  5. Each page displays 10 quotations. You can click on the “Next” and “Prev” button to see other results.
  6. Each quotation is displayed in a box. This text in this box is the quotation itself.
  7. This is the speaker of the quotation. If you click on the name, you will be brought to the Wikidata page of this speaker. Note that since many people might have the same name, the speaker is ambiguous, which means the displayed Wikidata page might not be of the actual speaker.
  8. This is the date and time when the quotation was published.
  9. These links are where the quotation was extracted from. Click on them to see the original article (link might have been dead).

Article-centric

If you search with the article-centric option, you should see something like this:

  1. The number of results and the time this query took are displayed here. If there are too many results, it will display Found more than OOO results. Only the first OOO results are displayed.
  2. This “Share” button will give you a link where you can share your query. Using the link will bring you to the search mode page, and all the filters you used will be pre-filled. You only need to click on the “Go” button to see the results.
  3. If you want to download the results, click on this “Save results” button. The results will be packed in JSON.
  4. The histogram shows the distribution of matched articles over the time window you specified in the query. If there are too many results, the number here will be more than the number of articles displayed.
  5. Each page displays 10 articles. You can click on the “Next” and “Prev” button to see other results.
  6. Each article is displayed in a box. This text in this box is the title of the article.
  7. This is the date and time when the article was published.
  8. These are the quotations inside this article. The quotations themselves and their speakers are listed.
  9. If the list is too long, by default it collapses. Click on the “Show more” button to see the full list.

Mobile support

Recently, Quotebank has been optimized to support mobile devices.