Making WordClouds in XPath/XQuery

Geplaatst op 12-03-2012 door Maarten Marx | onderwijs, XPath | tags: , | comment image Geen reacties »

We describe step by step how to make a basic wordcloud in XPath/XQuery.
In the tutorial we use two files:
input XML file en wordcloud XQuery file.


step 1
Collect data.
Pitfall HTML data from the web is not wellformed XML. Thus you cannot process it with XPath.
SolutionClean it up using tidy.
bash-3.2$ curl http://politicalmashup.nl/2012/02/politicalmashup-and-politix/| tidy -asxml - > file.xml
step 2
Explore the contents using XPath. This can nicely be done in oXygen.
Pitfall Processor cannot collect the DTD. No results due to namespace.
Solution Just remove those.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

simply becomes
<html>
Some useful expressions:
//p all paragraphs
//a all anchor text
//h1 | //h2 | //h3 |//h4 all headings

step 3
Only take what you want.
Suppose we want only the text in paragraphs. And only the main text of the page.
We could do that by restricting to paragraphs which have a certain length:

count(//p[string-length(.) gt 100])

only return paragraphs whose length is larger than 100 characters.
step 4
Extract the words:
tokenize(string-join(//p[string-length(.) gt 100],' '),'W+')
Gives a runtime error:

XPath failed due to: A sequence of more than one item is not allowed as the first argument of tokenize() ("Ook benieuwd naar wat de parti...", "De grondige analyse van de dat...")

problem The tokenize() function takes a single string as input. We have provide a sequence of strings…
Solution Glue them together:
string-join(//p,' ')

Step 5
What have we got? Count types and tokens

count(tokenize(string-join(//p[string-length(.) gt 100],' '),'W+') )
count(distinct-values(tokenize(string-join(//p[string-length(.) gt 100],' '),'W+') ))

This gets hard to read…. Luckily this is also allowed:

count(
distinct-values(
tokenize(
string-join(//p[string-length(.) gt 100],
' '),
'W+')
)
)
step 6
Code refactoring.
We switch to creating an XQuery file because of the following:

  1. code can be made more readable using let's
  2. we want to order words on their cardinality

However, this creates trouble in oXygen with the namespace...
Here is the solution: declare the namespace (use the same url as in the XML file), and use it whenever you refer to an element.

declare namespace h = 'http://ww.w3.org/1999/xhtml';
let $tokens := tokenize(string-join(//h:p[string-length(.) gt 100],' '),'W+') 
let $words := distinct-values($tokens)
return
for $w in $words return 
    concat($w,': ',count($tokens[. eq $w]),'
')


step 7
Finish everything. You can now output HTML with style instructions to create a real good-looking word cloud.

declare namespace h = 'http://ww.w3.org/1999/xhtml';
let $tokens := tokenize(string-join(//h:p[string-length(.) gt 100],' '),'W+') 
let $words := distinct-values($tokens)
return
{ (: note the curly braces: leave them out and you just output the XQuery, not the answer :) for $w in $words let $wc := count($tokens[. eq $w]) order by $wc descending return <span style='{concat("font-size: ",$wc,"pt;")}'>{$w}</span> (: again curly braces: now also inside the values of attributes :) }


step 8
Create better values for the font sizes. I leave this up to you. Note that algorithms which give good results use a logarithmic function, which XPath does not have....
oXygen tips

  1. Open both your input file and your xquery file in oXygen
  2. Use the XPath "search box" on your input file to test small XPath expressions. Note that you do not worry about the namespace.
  3. Put code together in your XQuery file.
  4. TYou can run an XQuery file in two ways in oXygen:
    1. Pressing the "play button". Choose "XQuery transformation with Saxon" as your scenario. This requires that you must specify the input file for the transformation inside the XQuery, and use it in your XPath expressions which refer to that file. This can be done as follows:

      declare namespace h = 'http://ww.w3.org/1999/xhtml';

      let $input := doc('file.xml')
      let $tokens := tokenize(string-join($input//h:p[string-length(.) gt 100],' '),'W+')

    2. Use the XQuery transformation button (on the far right on top in oXygen). Press it and choose your input and your XQuery file and press play. The earlier given XQuery now just works.

Reageer

Je moet ingelogd zijn om te kunnen reageren.