Yahoo! Answers Manner Questions, version 1.0

From Brede Wiki
Jump to: navigation, search
Dataset (help)
Yahoo! Answers Manner Questions, version 1.0
Abbreviations: L4
Variations:

Webscope L4

Category: Yahoo! Answers Manner Questions, version 1.0
Topics:

Question answering

Databases:
Links
Search
Papers: DOAJ Google Scholar PubMed
Ontologies: MeSH NeuroLex Wikidata Wikipedia
Other: Google Twitter WolframAlpha

Yahoo! Answers Manner Questions, version 1.0 (L4) is a question answering data set from Yahoo! Answers. The data set has been used in several papers with the canonical paper Learning to rank answers on large online QA collections.

Academic researchers can request the data from the Yahoo! website at:

http://webscope.sandbox.yahoo.com

All questions are 'manner questions' beginning with the word "how".

Contents

[edit] Content

The actual data comes in an XML called 'manner.xml' which has its own special format. Unpacked the file is 371 MB. There are 142627 of the 'vespaadd' tags, i.e., questions.

<?xml version='1.0' encoding='UTF-8'?>
<ystfeed>
  <vespaadd>
    <document type="wisdom">
      <uri> here is an integer question identifier.  </uri>
      <subject> the short question </subject>
      <content> optional few sentences description with context for the question </content>
      <bestanswer> the answer selected as the best. </bestanswer>
      <nbestanswers>
        <answer_item> one answer goes here. </answer_item>
        <answer_item> another answer.</answer_item>
        ...
      </nbestanswers>
      <cat> a category </cat>
      <maincat> another category </maincat>
      <subcat> another category, may be the same as 'cat' tag </subcat>
    </document>
  </vespaadd>
  <vespaadd>
    Here goes a new question-answer pair
    ...

The document tag has a type attribute. This is always 'wisdom':

$ grep "<document" manner.xml  | uniq
<vespaadd><document type="wisdom">

There are 26 categories for the 'maincat' tag:

$ grep "<maincat>" manner.xml  | sort | uniq | wc
     26      63     999

while there are 236 'subcat' and 509 'cat'. According to the README file category tags are all optional.

As the questions are all 'manner questions' there is a limited set of beginnings of the question:

$ perl -ne 'print lc "$1 $2\n" if /<subject>(.+?) +(.*?) /' manner.xml | sort | uniq -c 
  37557 how can
    803 how could
   2578 how did
  71141 how do
   5027 how does
   1211 how should
  21890 how to
   2420 how would

The letter case in the questions may vary, e.g., some use allcaps.

[edit] Categories

The main categories are:

$ grep "<maincat>" manner.xml | sort | uniq | sed 's/\&amp;/\&/' | perl -ne 'print "\"$1\", " if /<maincat>(.*?)<\/maincat>/' | fmt -s
"Arts & Humanities", "Asia Pacific", "Beauty & Style", "Business &
Finance", "Cars & Transportation", "Consumer Electronics", "Dining
Out", "Education & Reference", "Entertainment & Music", "Environment",
"Family & Relationships", "Food & Drink", "Games & Recreation", "Health",
"Home & Garden", "Local Businesses", "News & Events", "Pets", "Politics &
Government", "Pregnancy & Parenting", "Science & Mathematics", "Social
Science", "Società e culture", "Society & Culture", "Sports", "Travel",

There are 6296 pairs in the "Entertainment & Music" main category.

$ grep "<maincat>Entertainment &amp; Music</maincat>" manner.xml | wc
   6296   18888  289616

[edit] Python

Example Python programs that read and write data from the data set.

[edit] Show the first 20 question-answer pairs

import lxml.etree
from itertools import islice
 
tree = lxml.etree.parse(open('manner.xml'))
for n, element in islice(enumerate(tree.xpath('//document'), start=1), 20): 
    question = element.xpath('subject')[0].text
    answer = element.xpath('bestanswer')[0].text
    uri = element.xpath('uri')[0].text
    print(('{:5} {:6} {}\n    {}\n' + '-' * 80).format(n, uri, question, answer))

[edit] Write first 2000 answers to HTML files

import codecs
import lxml.etree
import os
 
directory = 'html'
if not os.path.exists(directory):
    os.makedirs(directory)
 
template = """<html>
  <head>
    <title>{}</title>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>{}</body>
</html>"""
tree = lxml.etree.parse(open('manner.xml'))
for n, element in enumerate(tree.xpath('//document'), start=1): 
    question = element.xpath('subject')[0].text
    answer = element.xpath('bestanswer')[0].text
    uri = element.xpath('uri')[0].text
    filename = os.path.join(directory, uri + '.html')
    with codecs.open(filename, mode='w', encoding='utf-8') as f:
        f.write(template.format(uri, answer))
    if n == 2000:
        break

[edit] Papers

  1. Learning to rank answers on large online QA collections
Personal tools