Enter your password to view commentsSeptember 7, 2010. Room: G3-29 (10-11) and H3-27 (11-12).
Lecturers: Reinder van der Heide and Maarten Marx
Slides (screen)
Slides (print)
Analyzing and Visualizing Social Networks of Members of Parliament
XPath sandbox. Try e.g.
collection(’/db/euparliament/data/EN’)//speech[ft:query(.,'Marx')] (all speeches mentioning Marx), or ask “Who mentions Marx?” by
collection(’/db/euparliament/data/EN’)//speech[ft:query(.,'Marx')]/@speaker
Here you ask for attributes and the database engine does not output those to the screen. If you want to see the value of the attributes you need to ask for the string-value of them. Like this:
for $s in collection(’/db/euparliament/data/EN’)//speech[ft:query(.,'Marx')]/@speaker
return string($s)
Most XPath tutorials and lessons use data-centric XML data files. For instance, the queries on the CIA-Factbook/Mondial database for first year Information Science at UvA.
We created a set of queries on the XML version of the proceedings of the European Parliament in English.
The queries can all be expressed in XPath 1.0. The level is introductory. Students can make try out their queries in an Exist sandbox and immediately see the results. The set of queries is available as a Google form.
Teachers who want to use this form are welcome. If you want to get the results of your students doe the following:
You obtain the answers as a csv file as soon as possible.
Model answers are also available by request.
Ed Summers posted the following message on the W3C EGov public mailing list:
I don’t know if this got discussed on here much yet, but I discovered
today via the Sunlight Foundation blog [1] that the Federal Register
2.0 site was recently released [2]. The Federal Register is one of the
most important government publications in the US, since it is the most
comprehensive publication of all the rules and regulations of the
various agencies that make up US federal government.The new site is interesting to me for a few reasons:
- it uses opensource technologies (ruby, ruby on rails, mysql, sphinx,
nginx, apache2, varnish)
- the source code for the website itself is opensource, and available
to people to contribute changes/enhancements on github
- there is machine readable data available various flavors of xml
- there are permalinks for each entry in the Federal Register, which
incourages citability
- it is deployed in the cloud on Amazon’s ec2/s3
- it was the result of an egov software contest organized by the
Sunlight FoundationI wrote up some more of my thoughts in my blog [3], if you care to
comment here or there. If anyone from NARA, GPO or Sunlight Foundation
are reading, nice work!//Ed
[1] http://sunlightlabs.com/blog/2010/meet-the-new-federal-register/
[2] http://www.federalregister.gov/
[3]
http://inkdroid.org/journal/2010/07/27/federal-register-embraces-the-web-and
-opensource/
Some missing aspects
This XML collection is potentially a great resource, but at least three things need to be done before the XML can be reused reliably in a mashup:
A fantastic aspect of the site is the ability to link to individual paragraphs in the documents.
Try for example http://www.federalregister.gov/a/2010-18383/p-12. This link is provided in the red ribbon to the right of the paragraph.
Mashups could potentially benefit from this feature. But unfortunately, these links are not present in the XML.
Conclusion
If you want to add this data to the Linked Open Data cloud, or if you want to create a mashup based on this data set, you have to screen scrape the HTML page which comes with each XML document.
This is a pity, because you are reverse engineering. Obviously this is not a reliable and stable solution.
Maarten Marx will give a keynote speech at the 2010 edition of ESAIR, the workshop on Exploiting Semantic Annotations for Information Retrieval, held during CIKM 2010.
Title the Surplus Value of Semantic Annotations.
Abstract
We compare the costs of semantic annotation of textual documents to its benefits for information processing tasks. Semantic annotation can improve the performance of retrieval tasks and facilitates an improved search experience through faceted search, focused retrieval, better document summaries, and result grouping.
Applications which summarize large collections of text or explain real world phenomena based on textual evidence may receive even more benefit from semantic annotations.
Semantic annotation creates surplus value if the annotated data can be used beyond any foreseen application. In particular by third parties linking your data by means of your semantic markup to other data with similar markup.
We present a list of properties of the annotated data which optimize this surplus value. They are derived from the principle which states that annotation should facilitate the reuse of data in a mashup without information being lost or distorted.
For the Dutch House of Parliament we annotated the parliamentary proceedings based on this principle. Concrete examples from this data collection will illustrate the surplus value enhancing properties.
Op deze pagina worden de laatste ontwikkelingen rond de overgebleven pot van Blijkmeer meegedeeld.
Arthur Suermondt heeft een bachelor scriptie geschreven over intellectueel eigendomsrecht op informatie van de overheid. Hij behandeld vragen als “Mag het rappport van de Commissie Davids worden opgenomen in een databank voor wetenschappelijk gebruik?”.
De scriptie is hier beschikbaar: A. Suermondt, Intellectual Property Rights on Public Sector Information, Bachelor thesis, University of Amsterdam, 2010.
| lees verder…
PoliticalMashup heeft een evaluatie gedaan van het script dat namen van politici koppelt aan hun biopagina bij parlement.com. Dit is gedaan op de welgevormde XML versies van de Handelingen zoals verkrijgbaar op overheid.nl.
Deze gegevens zijn verkregen met de stricte instelling van het script. Een verband vanuit een naam X wordt alleen gelegd als
Dit script is dus precisie georienteerd: als er geen exacte match is wordt er geen verband gelegd. We kunnen er dus wel van uitgaan dat –mits de database klopt– de links goed gelegd zijn.
Om spelfouten en spellingsvarianten te omzeilen moeten we “slim” gaan raden wie er bedoeld kan worden. Een algorithme wat simpel de naam met de kleinste Levenstein afstand kiest maakt dan al snel domme fouten. Daarmee wordt dan Bommel verbonden met Tommel en Bommer, beide met afstand 1. Echter, er is een veel voorkomende fout gemaakt: het tussenvoegsel van is weggevallen. De goede link was dan ook naar Harry van Bommel.
| TK | EK | formule | |
|---|---|---|---|
| Aantal voorkomens van naam | 501896 | 66760 | grep ‘MPid=”"‘ *tk* |wc |
| Aantal voorkomens niet gelinked | 8500 | 9802 | grep ‘MPid=”"‘ *tk* |wc |
| Aantal unieke voorkomens | 54531 | 9394 | grep -o ’speaker=”[^"]*’ *tk* |sort|uniq -c |sort -nr|wc |
| Aantal unieke voorkomens niet gelinked | 166 | 120 | grep ‘MPid=”"‘ *tk* |grep -o ’speaker=”[^"]*’ |sort|uniq -c |sort -nr|wc |
bash-3.2$ grep ‘MPid=”"‘ *tk* |grep -o ’speaker=”[^"]*’ |sort|uniq -c |sort -nr|head -20
1347 speaker=”Dibi
1058 speaker=”Bot
961 speaker=”Van Bijsterveldt-Vliegenthart
696 speaker=”J.M. de Vries
560 speaker=”Kamp
537 speaker=”Van Middelkoop
418 speaker=”Meijer
381 speaker=”G.M. de Vries
348 speaker=”Nuis
307 speaker=”Leerdam
282 speaker=”Hendriks
201 speaker=”Verstand
172 speaker=”B.M. de Vries
161 speaker=”Jules Kortenhorst
87 speaker=”Essers
40 speaker=”Donner
35 speaker=”Van den Berg
35 speaker=”Kok
35 speaker=”Blokland
33 speaker=”De Mos
bash-3.2$
De XML data van de kamervragen lijkt een stuk meer fouten te bevatten dan de handelingen. We zien erg veel typefouten en spellingsvariaties in namen.
Er zijn 109.165 personen (indieners en beantwoorders) in een corpus van 54.273 kamervragen.
Hiervan zijn er 5037 (5%) NIET van een link voorzien.
Er zijn 1084 unieke namen, waarvan er 370 niet van een link zijn voorzien. Dit zijn erg vaak typefouten en spellingsvarianten. Zie bijvoorbeeld:
bash-3.2$ grep -o “
1 name=”Aadsted-Madsen
1 name=”A.A.M. Willemse-van der Ploeg
7 name=”Aartsen
20 name=”Aasted Madsen
42 name=”Aasted-Madsen
2 name=”Aboutaleb
1 name=”Adelmund
2 name=”Adelmund.
9 name=”Albayrak
5 name=”Apostolou
7 name=”Azough
1 name=”A. J. te Veldhuis
1 name=”Baalen
6 name=”Baarda
1 name=”Ballin
5 name=”Barendregt
4 name=”Bemelmans-Videc
1 name=”Benschop
7 name=”Bibi de Vries
25 name=”Bierman
2 name=”Biermans
4 name=”Bijleveld-Schouten
2 name=”Bijsterveldt-Vliegenthart
23 name=”Blanksma
75 name=”B. M. de Vries
25 name=”B.M. de Vries
4 name=”Bomhoff
2 name=”Boogaard
17 name=”Boorsma
3 name=”Borst
15 name=”Bos
1 name=”Bosma en Wilders
bash-3.2$ grep -o “
1084 2997 29639
bash-3.2$ grep -o “
370 1090 10005
bash-3.2$ grep -o “
731 name=”Bot
633 name=”De Vries
372 name=”M. B. Vos
304 name=”Jasper van Dijk
174 name=”Van Bijsterveldt-Vliegenthart
163 name=”Dibi
120 name=”M. B. Vos
117 name=”Leerdam
107 name=”Nuis
75 name=”B. M. de Vries
bash-3.2$ grep -o “
5037 20044 302166
bash-3.2$ grep -o “
109165 372238 6903180
bash-3.2$ pwd
/scratch/data/parliament/nl/inprogress/overheid.nl/KVR/PM-transformed
bash-3.2$ ls |wc
54273 54273 1155050
bash-3.2$