Quality of the XML web

Geplaatst op 04-08-2011 door Maarten Marx | XML, research, resultaten | | Geen reacties »

A paper on the quality of the XML files found on the web will be published in the proceedings of the 2011 ACM Conference on Information and Knowledge Management (CIKM).

Abstract

We collect evidence to answer the following question: Is the quality of the XML documents found on the web sufficient to apply XML technology like XQuery, XPath and XSLT? XML collections from the web have been previously studied statistically, but no detailed information about the quality of the XML documents on the web is available to date. We address this shortcoming in this study. We gathered 180K XML documents from the web. Their quality is surprisingly good; 85.4% is well-formed and 99.5% of all specified encodings is correct. Validity needs serious attention. Only 25% of all files contain a reference to a DTD or XSD, of which just one third is actually valid. Errors are studied in detail. Automatic error repair seems promising. Our study is well documented and easily repeatable. This paves the way for a periodic quality assessment of the XML web.
The full paper and all data are publicly available at the url http://data.politicalmashup.nl/xmlweb.

Evaluation Methods for Rankings of Facetvalues for Faceted Search

Geplaatst op 19-07-2011 door Anne | research | | Geen reacties »

A paper on Evaluation Methods for Rankings of Facetvalues for Faceted Search was accepted at the Conference on Multilingual and Multimodal Information Access Evaluation 2011.Below is the abstract:

We introduce two metrics aimed at evaluating systems that select facetvalues for a faceted search interface. Facetvalues are the values of meta-data fields in semi-structured data and are commonly used to refine queries. It is often the case that there are more facetvalues than can be displayed to a user and thus a selection has to be made. Our metrics evaluate these selections based on binary relevant assessments for the documents in a collection. Both our metrics are based on Normalized Discounted Cumulated Gain, an often used Information etrieval metric.

A pdf version of the paper can be found here. There is also a longer version with experiments available.

@inproceedings{schuth_evaluation_2011 ,
title = {Evaluation Methods for Rankings of Facetvalues for Faceted Search},
booktitle = {Proceedings of the Conference on Multilingual and Multimodal Information Access Evaluation 2011},
year = {2011},
publisher = {Springer},
author = {Schuth, A. and Marx, M.J.}
}

Informatics Institute colloquium

Geplaatst op 12-05-2011 door Maarten Marx | lecture, parliament, research | | Geen reacties »

May 24 at 16.00, Maarten Marx will give a talk at the Informatics Institute colloquium.

Location: Science Park 904, Room D1.113, Amsterdam
Title: Parliamentary Information Systems
Abstract:
The proceedings of national parliaments are fascinating material for information scientists.
For the Netherlands, they consist of 197 years of digitally available data. Apart from some modern gaps (see http://politicalmashup.nl/2011/03/uva-informatica-onderzoek-leidt-tot-kamervragen/), this datset is complete. We have similar complete datasets for the UK, Spain and the Flemish parliament (though for shorter periods).

Anachronistically we can describe the data as a multimedia, hyperlinked database consisting mostly of rich semi-structured text documents.
Within the PoliticalMashup project, UvA turns this anachronism into reality. This opens a wealth of new research possibilities situated in the emerging field of computational humanities.

In the talk we will both show the techniques used for the transformation and applications within the computational humanities.

Justin van Wees looked at the existing communities within the Informatics Institute by analyzing k-clique communities within the IvI co-author graph. In the attached diagrams we show the largest 3-clique community (66 nodes ) and the two largest 4-clique communities contained in it (24 en 18 nodes). Next we show the second and third largest 3-clique communities (7 en 4 nodes). There is one more 4-clique community but that was contained in the large component, so we did not show it:
ivi_network_3_en_4.pdf and ivi_network_3_en_4.svg (the small 4 node component drifted out of the picture here)

Protected: Elsevier en de Tweede Kamer

Geplaatst op 08-04-2011 door Maarten Marx | Political Mashup, research | | Enter your password to view comments

This post is password protected. To view it please enter your password below:


PhD position in Logic/XML/Trees

Geplaatst op 07-04-2011 door Maarten Marx | XML, research | tags: | 1 reactie »

The university of Amsterdam has a fully funded 4-year PhD position available. The research topic is on the interplay of logic, finite model theory, and the theory of (XML)-trees and motivated by a concrete problem in database research:

Data Exchange for Document Centric XML.

| lees verder…

XML Prague 2011

Geplaatst op 01-04-2011 door Anne | XML, eXist, research, xquery | | Geen reacties »

PoliticalMashup was represented in Prague at the XML Prague conference. The day before, at the pre-conference, Anne presented his work on Fast Faceted Search in XML.

The captured livestream of that presentation is shown below. The slides are here.

Protected: als P dan P uitdaging

Geplaatst op 27-01-2011 door Maarten Marx | research | tags: | Enter your password to view comments

This post is password protected. To view it please enter your password below:


Protected: Invloed van nieuwe partijen in het parlement

Geplaatst op 19-01-2011 door Maarten Marx | research | tags: | Enter your password to view comments

This post is password protected. To view it please enter your password below:


Verzameling maidenspeeches

Geplaatst op 13-10-2010 door Maarten Marx | Uncategorized, onderwijs, research, trivia | | 1 reactie »

Binnen het PoliticalMashup project hebben we een verzameling maidenspeeches uit de Eerste en Tweede Kamer aangelegd. Volgens Wikipedia:

A maiden speech is the first speech given by a newly-elected member of a legislature or parliament.

In de Notulen van januari 1995 tot en met de zomer van 2010 hebben we 280 maidenspeeches gevonden.

Het plaatje hieronder bevat een speciale woordenwolk van alle maidenspeeches van één partij. De worden in de wolk zijn gekozen omdat ze heel erg het speciale van maidenspeeches van deze partij in vergelijking met die van andere partijen uitdrukken. De wolk is gemaakt door Rianne Kaptein en gebaseerd op haar werk samen met Jaap Kamps en Djoerd Hiemstra over woordenwolken.

Enig idee tot welke partij de sprekers van deze maidenspeeches behoren?

| lees verder…

Protected: Comparing XML schema languages

Geplaatst op 17-06-2010 door Maarten Marx | XML, research | tags: | Enter your password to view comments

This post is password protected. To view it please enter your password below: