Framing Questions on PoliDocs data

Authors: Loredana Afanasiev and Maarten Marx

Request from Rens Vliegenthart:

Ik zou voor een paper waar ik nu mee bezig ben al enorm gebaat zijn bij voor
ieder frame een overzichtslijst (van 1990-2008) met:
Naam parlementarier/partij/soort document (handeling/kamervraag/motie)/datum
Is het veel moeite zo'n lijst te maken of kan ik dat zelf op een bepaalde
manier in polidocs doen? 

Verder zijn deze zoektermen die ik voor de verschillende frames heb
gebruikt:
Multicultural frame
(multiculture*) AND (diversiteit or respect or verschil* or particip* or
dialoog or gesprek)
Emancipation
(allochto* or vreemdeling* or immigrant* or asielzoeker* or minderheden) AND
emancip*
AND (integr* or particip*)
Restriction
(importbruid* or nieuwkomer* or instroombeperking* or ((voorwaard* or eis*)
w/5
immigratie) or (wetgeving w/5 immigratie) AND inburgering)
Victimization
(hoofddoek* or ongelijkheid or eerwraak) AND ((allochto* or vreemdeling* or
immigrant*
or asielzoeker* or minderheden))
Anti-Islam
(islam* AND (bedreiging* or terrorisme))

The task

The task is to express the frames defined in the request above as
XQuery queries over the collections of XML documents containing the
Dutch parliamentary data served by http://www.polidocs.nl, and to compute statistics about their results.

More concretely for every frame that is operationalized in the request
above as a boolean keyword query retrieve the following information:

  1. the collection name
  2. the searchable content (the content on which the boolean keyword query is executed)
  3. a permanent url to the searchable content
  4. the (name of the) member of parliament that contributed the searchable content
  5. the party of the member of the parliament
  6. the date the search unit was produced

Polidocs data

Polidocs data consists of three collections of XML documents: HAN, KVR
and MOT. The files composing each collection are available at:

Note that I am working with a copy of the Polidocs data. My copy of HAN
is made on 2008-09-17, of MOT is made on 2008-11-13, and KVR is made on
2008-12-22. The current data served on Polidocs site might contain more
information.

Issues and design choices

To implement this task we have to make several design choices described in details below.

  1. For every collection type we have to choose the structural
    elements corresponding to the information need. Below, we describe our
    choices by giving the XPath queries that retrieve those elements.
  2. We have to choose the model used for evaluation of the
    boolean keyword queries that define the frames. We have a choice
    between substring and keyword matching model incorporated in
    XPath/XQuery and the IR keyword search models implemented by MonetDB/XQuery/PFTijah,
    the DBMS system we use to manage our data. Below, we give the
    definition of every frame both in XPath and in PFTijah formalism.

Matching the data to information need

– Handelingen

  1. collection name: HAN
  2. searchable content: for every document $d in the collection and every element $spreker returned by $d//spreker.
  3. polidocslink: concat("http://www.polidocs.nl/XML/HAN/",$d//metadata/item[@attribuut="Document-id"]/text(),".xml#",$spreker/@anker)
  4. person name: $spreker/@naam
  5. party name: $spreker/@partij
  6. date: $d//metadata/item[@attribuut="Datum_vergadering"]/text()

– Moties

  1. collection name: MotiesTweedeKamer?
  2. searchable content: for every document $d in the collection the element obtained with $d//tekstxml.
  3. polidocslink: concat("http://www.polidocs.nl/XML/MOT/",$d/document/@id,".xml")
  4. person name: $d//indienergnlod/item/naam/text()
  5. party name: $d//indienergnlod/item/partij/text()
  6. date: $d/document/datum/text()

– Twede Kamer Vragen

Issues:

  1. What to take as searchable content? The questions, the answers, or
    both? For start, we go for both the questions and the answers.
  2. A question can be spread on more than one document. The unique key for a question is stored in the //metadata/item[@attribuut="Vraagnr_bij_indiening"] element. For example the question with the key 2020310270 is spread over two different documents. We ignore this problem.
  3. Note that there might be more than one person that is responsible for the searchable content. We retrieve all persons.
  1. collection name: KVR
  2. searchable content: for every document $d in the collection, the elements obtained with $d//vragen/text() | $d//antwoorden/text().
  3. polidocslink: concat("http://www.polidocs.nl/XML/KVR/",$d//metadata/item[@attribuut="Document-id"]/text(),".xml")
  4. person name: $d//vraagdata/vrager/text()
  5. party name: $d//vraagdata/vrager/@partij
  6. date: $d//vraagdata/@indiendatum

The keyword search queries

Consider the frame Anti-Islam defined in the request. Below I will
exemplify different keyword search evaluation approaches possible in MonetDB/XQuery/PFTijah.

* Using the contains() function provided by XPath/XQuery. The expression . expresses one unit of searchable content.

contains(string(.), "islam") and
                            ( contains(string(.), "bedreiging") or
                              contains(string(.), "terrorisme"))

Note that the contains(str1,"islam") checks for substring matching, i.e., contains("abaislam","islam") equals to true and contains("islamaba","islam")
also equals to true. This does not correspond to the definition of the
Anti-Islam frame, but (due to diverse temporary technical problems aka
bugs) this is the closes approximation I can execute on MonetDB at the moment.

And the XQuery query defining the frame is:

(: Frame Anti-Islam
    (islam* AND (bedreiging* or terrorisme))

   Model: expressed via the contains() function of XQuery
:)

<frame name="anti-islam">
{

(: Collection HAN :)
<collection name="HAN">
{
for $d in collection('HAN')
  for $spreker in $d//spreker[contains(string(.), "islam") and
                              ( contains(string(.), "bedreiging") or
                                contains(string(.), "terrorisme")) ]
  return
    <result>
    {
     <date>{$d//metadata/item[@attribuut="Datum_vergadering"]/text()}</date>,
     <politicus name="{data($spreker/@naam)}"
                party="{data($spreker/@partij)}"/>,
     <polidocslink>{
       concat("http://www.polidocs.nl/XML/HAN/",$d//metadata/item[@attribuut="Document-id"]/text(),".xml#",$spreker/@anker)
     }</polidocslink>,
     <content>{string($spreker)}</content>
    }
    </result>
}
</collection>,

(: Collection MOT:)
<collection name="MOT">
{
for $d in collection("MOT")
  for $text in $d//tekstxml[contains(string(.), "islam") and
                            ( contains(string(.), "bedreiging") or
                              contains(string(.), "terrorisme")) ]
  return
    <result>
    {
     <date>{data($d/document/datum/text())}</date>,
     <politicus name="{string($d//indienergnlod/item/naam/text())}"
                party="{string($d//indienergnlod/item/partij/text())}"/>,
     <polidocslink>{
       concat("http://www.polidocs.nl/XML/MOT/",$d/document/@id,".xml")
     }</polidocslink>,
     <content>{string($text)}</content>
    }
    </result>
}
</collection>,

(: Collection KVR:)
<collection name="KVR">
{
for $d in collection("KVR")
  for $content in string-join((string($d//vragen),string($d//antwoorden))," ")
                           [contains(string(.), "islam") and
                            ( contains(string(.), "bedreiging") or
                              contains(string(.), "terrorisme")) ]
  return
    <result>
    {
     <date>{data($d//vraagdata/@indiendatum)}</date>,
     for $vrager in $d//vraagdata/vrager return
     <politicus name="{string($vrager)}"
                party="{string($vrager/@partij)}"/>,
     <polidocslink>{
       concat("http://www.polidocs.nl/XML/KVR/",string($d//metadata/item[@attribuut="Document-id"]),".xml")
     }</polidocslink>,
     <content>{$content}</content>
    }
    </result>
}
</collection>

}
</frame>

* Using the matches() function provided by XPath/XQuery. The expression . expresses one unit of searchable content.

matches(string(.), "(^|s)islam") and
                            ( matches(string(.), "(^|s)bedreiging") or
                              matches(string(.), "(^|s)terrorisme(s|$)"))

This is currently not working on my installation of MonetDB due to the following error:

!ERROR: pcre_match() not available as required version of libpcre was
not found by configure.

* Using the about() function of NEXI.

let $opt := <TijahOptions ft-index="polietiekedata" ir-model="LM" />
let $coll := collection('HAN')
let $nexi := ".[about(.,islam bereiding) or about(.,islam terrorisme)]"
for $res in tijah:query($coll,$nexi,$opt)

This is currently not working on MonetDB due to a bug that is being fixed.

Frame results

(All the queries and results presented below are stored in the following archive: http://staff.science.uva.nl/~marx/pub/PoliticalFrames/FramesOut.zip)

I designed 4 XQuery queries that define the following 4 frames (I
skipped the frame Restriction, since I don’t understand its semantics):

  1. Anti-islamic
  2. Emancipation
  3. Multicultural
  4. Victimization
  5. Added by Verkiezingskijker (for EU elections): Europe. Results are on http://staff.science.uva.nl/~marx/pub/PoliticalFrames/FramesEU.zip

The output of each frame query is an XML document corresponding to the following DTD:

<!ELEMENT frame (collection*)>
<!ATTLIST frame name CDATA #REQUIRED>
<!ELEMENT collection (result*)>
<!ATTLIST collection name CDATA #REQUIRED>
<!ELEMENT result (date,politicus+,polidocslink,content)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT politicus EMPTY>
<!ATTLIST politicus name CDATA #REQUIRED party CDATA #REQUIRED>
<!ELEMENT polidocslink (#PCDATA)>
<!ELEMENT content (#PCDATA)>

I gather all the query results in a directory called Frames for future aggregate computations.

Frame statistics (aggregates)

(All the queries and results presented below are stored in the following archive: http://staff.science.uva.nl/~marx/pub/PoliticalFrames/FramesOut.zip)

Note that in all the queries below, collection('Frames') evaluates to the collection of XML documents stored in the directory Frames constructed before.

Agg1 (Frame hits per collection)

(:
 Per frame per collection return the count hits
:)

<res>
{
for $f in collection('Frames')//frame
return
<frame>{
  $f/@name,
  for $c in $f//collection
  return
    <collection>
    {
      $c/@name,
      count($c/result)
    }
    </collection>
}
</frame>
}
</res>

Agg2 (Frame hits per year)

(:
 Per frame per year return the count hits
:)

declare function local:get-year($d as element()) as xs:string* {
  substring-before(string($d),'-')[string-length(.)=4 and not(contains(.,'-'))],
  substring-before(string($d),'.')[string-length(.)=4 and not(contains(.,'.'))],
  substring-after(string($d),'-')[string-length(.)=4 and not(contains(.,'-'))],
  substring-after(string($d),'.')[string-length(.)=4 and not(contains(.,'.'))]
};

<res>
{
for $f in collection('Frames')//frame
let $years :=
  for $d in $f//date return local:get-year($d)
return
<frame>
{
  $f/@name,
  for $y in distinct-values($years)
  order by $y
  return
  <year name="{$y}">{
    count($years[. eq $y])
  }
  </year>
}
</frame>
}
</res>

Agg3 (Frame hits per party)

(:
 Per frame per party return the count hits
:)

<res>
{
for $f in collection('Frames')//frame
let $parties := (for $r in $f//result return $r/politicus/@party)
return
<frame>
{
  $f/@name,
  for $p in distinct-values($parties) return
  <party name="{$p}">{
    count($parties[. eq $p])
  }
  </party>
}
</frame>
}
</res>

Plotting the aggregate values with Google Charts

For each aggregate query I provide another XQuery query that takes the
aggregate results and generates a Google Chart url that generates the
plot for the respective aggregate.

Issue: In XQuery I have to HTML escape the ampersand
symbol ‘&’. Thus, I have to do the reverse operation on the
resulting URL string. Below, you can see the commands I use to generate
the URL.

bash$ runsaxonb plotagg1.xq | perl -ne 's/&/&/g;print', where runsaxonb is the following wrapper command:
bash$ java -Xmx1024m -cp /path/to/saxon/saxon9.jar net.sf.saxon.Query -t $1

References

@article{RVpolidocs:07,
   Author = {Conny Roggeband and Rens Vliegenthart},
   Date-Added = {2009-04-01 15:30:25 +0200},
   Date-Modified = {2009-04-01 15:32:07 +0200},
   Journal = {{West European Politics}},
   Keywords = {polidocs},
   Month = {May},
   Number = {3},
   Pages = {524--548 },
   Title = {{Divergent framing: The public debate on migration in the Dutch parliament and media, 1995-2004}},
   Volume = {30},
   Year = {2007}}
Laatst aangepast op 01-04-2009 door Maarten Marx Geen reacties »

Notice: Theme without sidebar.php is deprecated since version 3.0 with no alternative available. Please include a sidebar.php template in your theme. in /var/www/html/PoliticalMashup/wp-includes/functions.php on line 3679