Brede non-negative matrix factorization


DTU Informatics >>> Neuroinformatics > Services > Brede non-negative matrix factorization from string
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
#    Web service for non-negative matrix factorization of a list of strings for topic mining
#
#    License: Affero General Public License
#
# $Id: brede_str_nmf,v 1.23 2012/08/16 18:51:52 fn Exp $

__version__ = '$Revision: 1.23 $'
__author__  = '$Author: fn $'
__all__     = [ 'brede_str_nmf' ]

import time
time_start = time.time()

filebase = "/var/local/www/"

from cgi    import FieldStorage, escape
import math
import numpy as np
import os
from re     import compile, findall, split, sub, DOTALL, UNICODE
from string import strip, lower
from scipy  import sparse
from urllib import urlopen, urlencode
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

import scipy
scipy_version = map(int, scipy.__version__.split('.'))

warnings = []

pattern_word = compile(r"""\w+""", UNICODE)

# Load sentiment word list
filenameAFINN = filebase + "/AFINN-111.txt"
try: 
    afinn = dict(map(lambda (w, s): (unicode(w, 'utf-8'), int(s)), [ 
          ws.strip().split('\t') for ws in open(filenameAFINN) ]))
except:
    warnings.append('Could not read sentiment word list')
    afinn = {}


# Colors to sentiment
colors = { 
    -5: "#FF2222",
     -4: "#FF4444",
     -3: "#FF6666",
     -2: "#FF8888",
     -1: "#FFCCCC",
     0: "#FFFFF",
     1: "#CCFFCC",
     2: "#88FF88",
     3: "#66FF66",
     4: "#44FF44",
     5: "#22FF22"}

def stopwords():
    """ 
    Return the stopwords as a list.
    If the stopword file is not available then return an empty list.
    """
    filename = filebase + "/stop_english1.txt"
    try: 
        words = [ unicode(line.strip(), 'utf-8') for line in open(filename).readlines() ]
    except:
        warnings.append('Could not read stopword list')
        words = []
    return words
  

def example_texts():
    """
    Example texts
    """
    return u"""Denmark, officially the Kingdom of Denmark together with Greenland and the Faroe Islands, is a Scandinavian country in Northern Europe.
Sweden, officially the Kingdom of Sweden, is a Nordic country on the Scandinavian Peninsula in Northern Europe. Sweden has land borders with Norway to the west and Finland to the northeast, and water borders with Denmark, Germany, and Poland to the south, and Estonia, Latvia, Lithuania, and Russia to the east. Sweden is also connected to Denmark by a bridge-tunnel across the Öresund.
Germany, officially the Federal Republic of Germany, is a country in Western Europe. It is bordered to the north by the North Sea, Denmark, and the Baltic Sea; to the east by Poland and the Czech Republic; to the south by Austria and Switzerland; and to the west by France, Luxembourg, Belgium, and the Netherlands.
Australia, officially the Commonwealth of Australia, is a country in the Southern Hemisphere comprising the mainland of the Australian continent, the island of Tasmania and numerous smaller islands in the Indian and Pacific Oceans.
The Republic of Botswana is a sub Saharan country located in Southern Africa. The citizens are referred to as "Batswana" . Formerly the British protectorate of Bechuanaland, Botswana adopted its new name after becoming independent within the Commonwealth on 30 September 1966. It has held free and fair democratic elections since independence.
Oman, officially the Sultanate of Oman, is an Arab country in southwest Asia on the southeast coast of the Arabian Peninsula.
Lego (trademarked in capitals as LEGO) is a line of construction toys manufactured by the Lego Group, a privately held company based in Billund, Denmark.
Nestlé S.A. is the largest consumer packaged goods company in the world, founded and headquartered in Vevey, Switzerland.
Tetra Pak is a multinational food processing and packaging company of Swedish origin. 
Arla Foods is a Swedish-Danish cooperative based in Århus, Denmark, and the largest producer of dairy products in Scandinavia.
H. Lundbeck A/S is a Danish international pharmaceutical company engaged in the research and development, production, marketing, and sale of drugs for the treatment of psychiatric and neurological disorders, particularly fast-working, non-addictive anti-depressants for people who need to function while medicated.
Novo Nordisk manufactures and markets pharmaceutical products and services.
The Carlsberg Group is a Danish brewing company founded in 1847 by J. C. Jacobsen after the name of his son Carl. The headquarters are in Copenhagen, Denmark. """


def str2html(s):
    """
    Expects a Unicode string and converts it to UTF-8 and escapes it.
    """
    return escape(s.encode('utf-8'))


def parseform(inform):
    outform = {'data': '',
               'components': None,
               'docsep': 'newline',
               'format': 'default',
               }
    try:
        outform['data'] = unicode(inform.getvalue("data", default=""), 'utf-8')
    except:
        warnings.append('Could not convert input text to Unicode')
    u_format = inform.getvalue("format", default='default')
    if u_format in ['default', 'script', 'scriptfm']:
        outform['format'] = u_format

    u_components = inform.getvalue("components", default=None)
    if u_components and u_components.isdigit():
        components = int(u_components.strip())
        if components < 1:
            components = 1
        outform['components'] = components

    u_docsep = inform.getvalue("docsep", default='newline')
    if u_docsep in ['newline', 'twonewlines']:
        outform['docsep'] = u_docsep

    return outform


def header():
    """
    Returns a string with HTML header  
    """
    return """Content-type: text/html; charset=utf-8

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
    <title>Brede non-negative matrix factorization &mdash; DTU Informatics &mdash; Technical University of Denmark</title>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <meta http-equiv="Content-language" content="en">
        
    <meta name="description" content="Non-negative matrix factorization">
    <meta name="rating"      content="General">
        
    <link rel="author" href="http://www.imm.dtu.dk/~fn/">
    <link rel="home" href="http://neuro.imm.dtu.dk">

    <link rel="stylesheet" href="/css/brede.css" type="text/css">

    <script type="text/javascript">
      function example_texts() {
	var text = "%s";
        return text;
      }
    </script>
  </head>

  <body>
    <table class="noborder">
      <tr>
        <td width="135">
          <img width="135" height="135" src="/icon/bredewiki.png">
        <td width="100%%">
          <h1>Brede non-negative matrix factorization</h1>
          <hr>
          <a href="http://www.imm.dtu.dk/English.aspx">DTU Informatics</a> &gt;&gt;&gt;
          <a href="http://neuro.imm.dtu.dk/home.html">Neuroinformatics</a> &gt;
          <a href="http://neuro.imm.dtu.dk/services/services.html">Services</a> &gt; <a href="?">Brede non-negative matrix factorization from string</a>
          <hr>
    </table> 
    """ % (sub(r'\n', r'\\n', sub(r'"', r'\\"', str2html(example_texts()))), )


def footer():
    return """
    <hr>
    <a href="http://www.imm.dtu.dk/~fn/">Finn &Aring;rup Nielsen</a>,
    <a href="http://www.imm.dtu.dk/English.aspx">DTU Informatics</a>
    &mdash; %.3f seconds
  <body>
</html>""" % (time.time() - time_start)


def inputform(data="", components="", docsep="newline"):
    if not components:
        components = ""
    if docsep == 'newline':
        docsep1, docsep2,  = 'checked', ''
    else:
        docsep1, docsep2,  = '', 'checked'
    return """
    <form name="inputform" action="/cgi-bin/brede_str_nmf" method="POST">
      <textarea style="width:100%%;" name="data" rows="20" type="text">%s</textarea>
      <br>
      <input type="submit" value="Analyze">
      Number&nbsp;of&nbsp;topics:&nbsp;<input type="text" name="components", value="%s" size="5">
      Document&nbsp;separator:&nbsp;<input type="radio" name="docsep" value="newline" %s>Newline</a>&nbsp;<input type="radio" name="docsep" value="twonewlines" %s>Two newlines</a>
      <br>
      <input type="button" onClick="data.value = example_texts(); components.value = 2;" value="Example texts">
      <input type="button" onClick="data.value = '';" value="Clear text">
    </form>
    
    """ % (str2html(data), escape(str(components)), docsep1, docsep2)


def description():
    return """
    <hr>
    <h2>Description</h2>
    This web service will perform topic mining with sentiment analysis, 
    see <a href="?format=script">the script</a>
    (<a href="?format=scriptfm">formatted</a>).
    <h3>Procedure</h3>
    <ol>
    <li>Type in texts: one in each line.
    <li>(Set the number of topics if not you want it automatically determined)
    <li>Press "Analyze" and wait. 
    </ol>
 
    <h3>The results</h3>
    The results with the texts will be grouped in topics. 
    The value in parentheses after a word or a text is the "load" 
    of the word or the text on the topic. 

    <h3>Details</h3>
    <ol>
    <li>The topic mining is performed with <a
    href="http://en.wikipedia.org/wiki/Non-negative_matrix_factorization">non-negative
    matrix factorization</a>. 
    <li>The sentiment analysis via the <a href="http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010">AFINN</a> word list.
        The sentiment of a topic is found by summing sentiment of the individual        texts weighted by the number of texts in the topic. 
        The sentiment of each individual text is found by summing the 
        sentiment strength of each word weighted by the number of words.
        The weighting is by the square root.

    <li>A word list excludes common English words ("stopwords")
    <li>The analysis will only work on up to a few hundreds short texts.
    <li>The results may change each time you run the algorithm.
        This is due to random initializations and the issues in the 
        factorization algorithms.
    <li>The value shown after each topic, each word and each document is the load       telling how important they are.
    <li>Texts and words are assigned exclusively to one topic
        even if some of the texts are load partially on two or more
        topics.
    <li>If the load of some texts or words are zero then they show up in 
        the "not assigned" category.
    <li>The example texts are taken as the first (few) line(s) in  
        Wikipedia articles about companies and countries. 
        If the topic mining works well it should separate country and company
        texts.
    </ol>

    <h3>References</h3>
    <ol>
    <li><a href="http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3661/pdf/imm3661.pdf">Mining the posterior cingulate: Segregation between memory and pain components</a>. Finn Årup Nielsen, Daniela Balslev, Lars Kai Hansen. NeuroImage, 27(3):520-532, 2005.
    <li><a href="http://ceur-ws.org/Vol-718/paper_16.pdf">A new ANEW:
    Evaluation of a word list for sentiment analysis in
    microblogs</a>, Finn Årup Nielsen,
    Proceedings of the ESWC2011 Workshop on 'Making Sense of
    Microposts': Big things come in small packages, May 2011.  
    </ol>
    </div>
"""




def texts2matrix(texts):
    """
    Convert a list of strings to a bag-of-words matrix.
    A sparse scipy matrix and identified terms are returned.
    A stopword list is applied if available.
    """
    wordlists = [ map(lower, pattern_word.findall(text)) for text in texts ]
    wordsfreq = {}
    for wordlist in wordlists:
        for word in set(wordlist):
            wordsfreq[word] = wordsfreq.get(word, 0) + 1
    # Remove single instance words
    terms = [ word for word in wordsfreq if wordsfreq[word] > 1 ]
    terms = list(set(terms).difference(stopwords()))
    if terms:
        M = sparse.lil_matrix((len(texts), len(terms)))
        for n in range(len(texts)):
            for m in range(len(terms)):
                M[n,m] = wordlists[n].count(terms[m])
    else:
        M = None
    return (M, terms)


def wta((W, H)): 
    """
    Winner-take-all function
    """
    return (np.asarray(map(lambda col: map(lambda e: e*int(e==col.max()), col), np.asarray(W))),
            np.asarray(map(lambda row: map(lambda e: e*int(e==row.max()), row), np.asarray(H).T)).T)


def adjustwh((W, H)):
    """
    Change the scaling of W and H, and record them according to scaling
    """
    wsum = np.asarray(W.sum(axis=0)).flatten()
    hsum = np.asarray(H.sum(axis=1)).flatten()
    whsum = wsum * hsum
    whsumsqrt = np.sqrt(whsum)
    I = np.argsort(-whsum)
    Dw = np.mat(np.diag(whsumsqrt[I]/wsum[I]))
    Dh = np.mat(np.diag(whsumsqrt[I]/hsum[I]))
    return (W[:,I]*Dw, Dh*H[I,:])


def nmf(M, components=None, iterations=50):
    """
    Non-negative matrix factorization
    """
    if not components:
        components = np.ceil(np.sqrt(float(min(M.shape))/2))
    W = np.mat(np.random.rand(M.shape[0], components))
    H = np.mat(np.random.rand(components, M.shape[1]))
    if scipy_version[0] == 0 and scipy_version[1] < 7 and components > 1:
        for n in range(0, iterations): 
            H = np.multiply(H, (W.T * M).todense() / (W.T * W * H + 0.001))
            W = np.multiply(W, (M * H.T).todense() / (W * (H * H.T) + 0.001))
    else:
        for n in range(0, iterations): 
            H = np.multiply(H, (W.T * M) / (W.T * W * H + 0.001))
            W = np.multiply(W, (M * H.T) / (W * (H * H.T) + 0.001))
    return (W, H)


def text2sentiment(text):
    """
    Text as string as input.
    A float for valence is returned. Positive for positive valence.  
    """
    words = pattern_word.findall(text.lower())
    sentiments = map(lambda word: afinn.get(word, 0), words)
    if sentiments:
        sentiment = float(sum(sentiments))/math.sqrt(len(sentiments) + np.finfo(float).eps)
    else:
        sentiment = 0
    return sentiment



def components2html_overview(W, H, texts, terms):
    """
    Turn the topics into HTML
    """
    weight = np.asarray(W.sum(axis=0)).flatten()
    iw = np.argsort(-W, axis=0)
    ih = np.argsort(-H, axis=1)
    # Iterate over all topics
    print("""<table><tr>""")
    for n in range(W.shape[1]):
        print("""<td><table>""")
        print("""<tr><th><a href="#topic_%d">Topic %d</a>""" % (n+1, n+1))
        if afinn:
            sentiments = []
            for m in range(len(texts)):
                if W[iw[m,n],n] > 0:
                    t = texts[iw[m,n]]
                    sentiments.append(text2sentiment(t))
            sentiment = sum(sentiments)/math.sqrt(len(sentiments) + np.finfo(float).eps)
            color = colors[min(5, max(-5, round(2*sentiment)))]
            print("""<tr><td style="background-color: %s; ">Sentiment: %.2f""" % (color, sentiment))
        for m in range(min(5, len(terms))):
            if H[n,ih[n,m]] > 0:
                print("<tr><td>%s " % (str2html(terms[ih[n,m]]),))
            else:
                break
        print("""</table>""")
    print("""</table>""")

def components2html_details(W, H, texts, terms):
    """
    Turn the text and terms into HTML
    """
    weight = np.asarray(W.sum(axis=0)).flatten()
    iw = np.argsort(-W, axis=0)
    ih = np.argsort(-H, axis=1)
    # Iterate over all topics
    for n in range(W.shape[1]):
        print("""<h3><a name="topic_%d">Topic %d (%.2f)</a></h3>""" % (n+1, n+1, weight[n]))
        for m in range(len(terms)):
            if H[n,ih[n,m]] > 0:
                print("%s (%.2f) " % (str2html(terms[ih[n,m]]), H[n,ih[n,m]]))
            else:
                break
        print("<ol>")
        sentiments = []
        for m in range(len(texts)):
            if W[iw[m,n],n] > 0:
                t = texts[iw[m,n]]
                if afinn:
                    sentiments.append(text2sentiment(t))
                print("<li>%s (%.2f)" % (str2html(t), W[iw[m,n],n]))
        print("</ol>")
        if afinn:
            sentiment = sum(sentiments)/math.sqrt(len(sentiments) + np.finfo(float).eps)
            print("Sentiment: %.2f" % sentiment)

    # Write out documents not assigned to a topic
    print("""<h3><a name="topic_0">Not assigned</a></h3>""")
    for m in range(len(terms)):
        if not any(H[:,m]):
            print("%s" % (str2html(terms[m])))
    print("<ol>")
    sentiments = []
    for m in range(len(texts)):
        if not any(W[m,:]):
            t = texts[m]
            if afinn:
                sentiments.append(text2sentiment(t))
            print("<li>%s" % (str2html(t)))
    print("</ol>")
    if afinn:
        sentiment = sum(sentiments)/math.sqrt(len(sentiments) + np.finfo(float).eps)
        print("Sentiment: %.2f" % sentiment)




# Get CGI field value
form = parseform(FieldStorage())

if form['format'] == 'default':

    print(header())

    if form['docsep'] == "twonewlines":
        lines = split(r"\n\s*\n", form['data'])
    else:
        # form['docsep'] == "newline":
        lines = split(r"\n", form['data'])

    texts = ""
    if lines: 
        texts = lines

    text = "\n".join(texts)
    print(inputform(data=form['data'], components=form['components'], docsep=form['docsep']))

    if texts:
        matrix, terms = texts2matrix(texts)
        print("""
      <hr>
      %d documents<br>
      %d terms: %s""" % (len(lines), len(terms), str2html(" ".join(terms)))) 
        print("""<hr>
      <h2>Result</h2>""")
        if matrix:
            left_matrix, right_matrix = wta(adjustwh(nmf(matrix, components=form['components'])))
            components2html_overview(left_matrix, right_matrix, texts, terms)
        else:
            print("(no results)")

        print("""<h2>Result details</h2>""")
        if matrix:
            components2html_details(left_matrix, right_matrix, texts, terms)
        else:
            print("(no results)")

    if warnings:
        print("""
    <hr>
    <h2>Warnings</h2>""")
        for warning in warnings:
            print("""<div class="warning">%s</div><br>""" % str2html(warning))

    print(description())  
    print(footer())

elif form['format'] == 'script':
  
    print("Content-type: text/plain; charset=utf-8\n")
    filename_script = os.environ.get("SCRIPT_FILENAME", "").strip()
    s = open(filename_script).read()
    print(s)

elif form['format'] == 'scriptfm':
  
    print(header())
    print("<pre>")
    filename_script = os.environ.get("SCRIPT_FILENAME", "").strip()
    s = open(filename_script).read()
    print(str2html(s))
    print("</pre>")
    print(description())  
    print(footer())



Description

This web service will perform topic mining with sentiment analysis, see the script (formatted).

Procedure

  1. Type in texts: one in each line.
  2. (Set the number of topics if not you want it automatically determined)
  3. Press "Analyze" and wait.

The results

The results with the texts will be grouped in topics. The value in parentheses after a word or a text is the "load" of the word or the text on the topic.

Details

  1. The topic mining is performed with non-negative matrix factorization.
  2. The sentiment analysis via the AFINN word list. The sentiment of a topic is found by summing sentiment of the individual texts weighted by the number of texts in the topic. The sentiment of each individual text is found by summing the sentiment strength of each word weighted by the number of words. The weighting is by the square root.
  3. A word list excludes common English words ("stopwords")
  4. The analysis will only work on up to a few hundreds short texts.
  5. The results may change each time you run the algorithm. This is due to random initializations and the issues in the factorization algorithms.
  6. The value shown after each topic, each word and each document is the load telling how important they are.
  7. Texts and words are assigned exclusively to one topic even if some of the texts are load partially on two or more topics.
  8. If the load of some texts or words are zero then they show up in the "not assigned" category.
  9. The example texts are taken as the first (few) line(s) in Wikipedia articles about companies and countries. If the topic mining works well it should separate country and company texts.

References

  1. Mining the posterior cingulate: Segregation between memory and pain components. Finn Årup Nielsen, Daniela Balslev, Lars Kai Hansen. NeuroImage, 27(3):520-532, 2005.
  2. A new ANEW: Evaluation of a word list for sentiment analysis in microblogs, Finn Årup Nielsen, Proceedings of the ESWC2011 Workshop on 'Making Sense of Microposts': Big things come in small packages, May 2011.

Finn Årup Nielsen, DTU Informatics — 0.081 seconds