Binnen het Clarin project War In Parliament spelen named entities een belangrijke rol. In de Handelingen der Staten Generaal bepalen we voor elke woord wie het gesproken heeft. Met behulp van Named Entity Recognition technieken bepalen we in de uitgesproken tekst over welke entiteiten er dan gesproken wordt.
Nadat we de entiteiten herkend hebben proberen we ze te normaliseren door ze te verbinden met Wikipedia paginas.
We kunnen dan vragen beantwoorden als
Wie spreekt over wie?
Wie heeft het het meest over locatie X?
Welke organisaties worden het meest in de Kamer besproken. Splits dat uit per partij.
Welke Kamerlid spreekt het meest over zijn woonplaats of geboorteplaats?
@inproceedings{
title = {Two-stage named-entity recognition using averaged perceptrons},
author = {L. Buitinck and M. Marx},
booktitle = {Proc. 17th International Conference on
Applications of Natural Language Processing
to Information Systems},
editor = {G. Bouma and A. Ittoo and E. M\'{e}tais
and H. Wortmann},
publisher = {Springer},
address = {Groningen, Netherlands},
year = 2012
}
Door het beschikbaar komen van enorme digitale databestanden bestaande uit (meestal ingescande) teksten is er grote vraag bij Geesteswetenschappers ontstaan naar hulp bij het ontsluiten van die data.
In de projecten die ILPS doet met Geesteswetenschappers komen de volgende twee verzoeken steeds naar voren:
uitgebreide “advanced search” zoekmogelijkheid, net zo goed als Google, maar dan op mijn specifieke collectie, met specifieke extra zoekmogelijkheden;
het doen van data analyse op grote hoeveelheden tekst. Dit om hypotheses kwantitatief te kunnen toetsen.
Handelingen der Staten Generaal
Binnen het PoliticalMashup project wordt samengewerkt met geesteswetenschappers van DNPP, NIOD, ING-Huygens, Meertens, INL, ASCoR, en verschillende universiteiten en maatschappelijke instellingen.
Zij hebben grote interesse in een prachtig databestand: de complete Handelingen der Staten Generaal van 1814 tot vandaag. Die zijn digitaal beschikbaar bij de KB.
We demonstreren de kracht van informatie extractie samen met gestructureerde zoektechnologie in XML aan de hand van twee voorbeelden:
de drie van Breda
NWO
In deze demonstratie beperken we ons tot het uitgebreid zoeken. | lees verder…
We collect evidence to answer the following question: Is the quality of the XML documents found on the web sufficient to apply XML technology like XQuery, XPath and XSLT? XML collections from the web have been previously studied statistically, but no detailed information about the quality of the XML documents on the web is available to date. We address this shortcoming in this study. We gathered 180K XML documents from the web. Their quality is surprisingly good; 85.4% is well-formed and 99.5% of all specified encodings is correct. Validity needs serious attention. Only 25% of all files contain a reference to a DTD or XSD, of which just one third is actually valid. Errors are studied in detail. Automatic error repair seems promising. Our study is well documented and easily repeatable. This paves the way for a periodic quality assessment of the XML web.
The full paper and all data are publicly available at the url http://data.politicalmashup.nl/xmlweb.
We introduce two metrics aimed at evaluating systems that select facetvalues for a faceted search interface. Facetvalues are the values of meta-data fields in semi-structured data and are commonly used to refine queries. It is often the case that there are more facetvalues than can be displayed to a user and thus a selection has to be made. Our metrics evaluate these selections based on binary relevant assessments for the documents in a collection. Both our metrics are based on Normalized Discounted Cumulated Gain, an often used Information etrieval metric.
@inproceedings{schuth_evaluation_2011 ,
title = {Evaluation Methods for Rankings of Facetvalues for Faceted Search},
booktitle = {Proceedings of the Conference on Multilingual and Multimodal Information Access Evaluation 2011},
year = {2011},
publisher = {Springer},
author = {Schuth, A. and Marx, M.J.}
}
May 24 at 16.00, Maarten Marx will give a talk at the Informatics Institute colloquium.
Location: Science Park 904, Room D1.113, Amsterdam Title: Parliamentary Information Systems Abstract:
The proceedings of national parliaments are fascinating material for information scientists.
For the Netherlands, they consist of 197 years of digitally available data. Apart from some modern gaps (see http://politicalmashup.nl/2011/03/uva-informatica-onderzoek-leidt-tot-kamervragen/), this datset is complete. We have similar complete datasets for the UK, Spain and the Flemish parliament (though for shorter periods).
Anachronistically we can describe the data as a multimedia, hyperlinked database consisting mostly of rich semi-structured text documents.
Within the PoliticalMashup project, UvA turns this anachronism into reality. This opens a wealth of new research possibilities situated in the emerging field of computational humanities.
In the talk we will both show the techniques used for the transformation and applications within the computational humanities.
Justin van Wees looked at the existing communities within the Informatics Institute by analyzing k-clique communities within the IvI co-author graph. In the attached diagrams we show the largest 3-clique community (66 nodes ) and the two largest 4-clique communities contained in it (24 en 18 nodes). Next we show the second and third largest 3-clique communities (7 en 4 nodes). There is one more 4-clique community but that was contained in the large component, so we did not show it: ivi_network_3_en_4.pdf and ivi_network_3_en_4.svg (the small 4 node component drifted out of the picture here)
The university of Amsterdam has a fully funded 4-year PhD position available. The research topic is on the interplay of logic, finite model theory, and the theory of (XML)-trees and motivated by a concrete problem in database research: