Tuesday 23 November 2010

A Wikipedia that AI systems can safely learn from

Below is a letter I wrote in reply to a fundraising email from Jimmy Wales, for Wikipedia.  I heartily support the wonderful effect of Wikipedia, and was involved in nupedia, from whose ashes Wikipedia rose phoenix-like. However, I think that an inadvertent but serious error was made at Wikipedia’s founding – the adoption of GFDL instead of a truly free, CC 0 style license. I believe that the SA licence unnecessarily interferes with the freedom of Wikipedia users to use the content, and that that is regrettable. In the case of human users, though, the effect is mitigated by the difficulty of establishing what of their subsequent intellectual work is a derivative work of Wikipedia. For AI systems, though, this SA requirement implies a level of invasion of data privacy (the systems’ but also that of people they interact with) that is wholly unconscionable. The letter suggests a way to slowly remedy this, by building up, gradually, a truly free portion of Wikipedia.  I hope it is adopted.


I think this appeal will be effective. However, its effectiveness for me is reduced by the fact that it's not entirely true that "you can use the information in Wikipedia any way you want".  You cannot combine it with other information without infecting that combination with a "Share Alike" obligation that you are obliged to impose on others.

If you were able to persuade the Foundation to give creators of new articles a choice of creating them under a pure CC  licence, with no SA, and if it was permissible to create parallel articles, without reuse of SA content, under the truly FREE CC 0 licence, then Wikipedia would be truly free, as in freedom. And, if that happens, I will make Wikipedia my main object of charity, and will encourage others to do so. 

If not, perhaps you could modify the language in the appeal to be more legally accurate. However, even as flawed and unfree as it is, it remains at present, a wonderful thing, and I will probably continue to donate, a little reluctantly.”

Friday 19 March 2010

Semantic Data

Part of what we’ve been trying to do with the LarKC project is to scale up AI to tackle real problems. One part of that is supporting the storage of vast amounts of inferentially productive knowledge. The SemData initiative is trying to do just that.




Workshop on Semantic Data Management (SemData)

At the 36th International Conference on Very Large Data Bases

Singapore: 13 - 17 Sept 2010, Grand Copthorne Waterfront Hotel

The Semantic Web represents the next generation Web of Data, where information
is published and interlinked in order to facilitate the exploitation of its
structure and meaning for both humans and machines. Semantic Web applications
require database management systems for the handling of structured data, taking
into consideration the models used to represent semantics. To foster the
realization of the Semantic Web, the World Wide Web Consortium (W3C) developed
a set of metadata models, ontology models, and query languages. Today, most of
the Semantic Web repositories are database engines, which store data
represented in RDF, support SPARQL queries, and can interpret schemata and
ontologies represented in RDFS and OWL. We are thus at the point where the
adoption of semantic technologies is growing. However, these technologies often
appear to be immature, and tend to be too expensive or risky to deploy in real
business. Solid data management layer concepts, architectures, and tools are
important to everyone in the semantic ecosystem, and creating them requires a
strong community, with a critical mass of involvement.

Semantic data management refers to a range of techniques for the manipulation
and usage of data based on its meaning. It enables sustainable solutions for a
range of IT environments, where the usage of today's mainstream technology is
either inefficient or entirely unfeasible: enterprise data integration, life
science research, data sharing in SaaS architectures, querying linked data on
the Web. In a nutshell, semantic data management fosters the economy of
knowledge, facilitating more comprehensive usage of larger scale and more
complex datasets at lower cost.

The goal of the SemData workshop is to provide a platform for the discussion
and investigation of various aspects related to semantic databases and data
management in the large. Many of the semantic data management challenges
cumulate in the need for scalable and performing database solutions for
semantic data, a building block that runs largely behind comparable
non-semantic technologies. In order to make semantic technologies take on the
targeted market share, it is indispensable that technological progress allows
semantic repositories to reach near performance parity with some of the best
RDBMS solutions without having to omit the advantages of a higher query
expressivity compared to basic key-value stores, or the higher schema
flexibility compared to the relational model. It is time that one must no
longer pay a heavy price in terms of longer run times or more expensive
equipment for profiting from the flexibility of the generic physical model
underlying the semantic graph-based structures of RDF. We also recognize that
there will always be a burden with more flexibility. Hence, the goal is to
minimize the drawbacks and maximize the advantages of the semantic RDF-minded

The SemData workshop seeks trans-disciplinary expert discussions on issues such
as semantic repositories, their virtualization and distribution, and
interoperability with related database solutions such as relational, XML, graph
databases or others. We thus welcome original academia and industry papers or
project descriptions that propose innovative approaches for semantic data
management in the large, with a particular focus on semantic database solutions
including their virtualization and distribution.

The topics of interest of this workshop include but are not limited to:
* semantic repositories and databases: storage facilities for semantic artifacts,
RDF repositories, reasoning supported data management infrastructures, data
base schemas optimized for semantic data, indexing structures, storage density
and performance improvements
* distribution, interoperability, and benchmarking: "Classical" semantic storage
subjects: distributed repositories (data partitioning, replication, and
federation); interoperability and integration with RDBMS; performance
evaluation and benchmarking
* virtualized semantic repositories: identification and composition of (fragments
of) datasets in a manner, abstracting the applications from the specific setup
of the data management service (e.g. local vs. remote and distribution)
* semantic data bus: a communication layer bridging the gap between the data
layer and the application layer
* embedded data processing: "move the processing close to the data" mechanisms,
allowing application-specific data processing to be performed within the
semantic repository, e.g. stored procedures and engine extension APIs
* adaptive indexing and multi-modal retrieval: strategies for dynamic
materialization towards specific data- and query-patterns; indexing structures
for specific types of data and queries (FTS, co-occurrence, concordance,
temporal, spatial)

Paper Submission Deadline May 21, 2010
Acceptance Notification June 21, 2010
Camera Ready July 11, 2010
SemData Workshop September 17, 2010

The papers must be submitted in the VLDB format; please see
http://vldb2010.org/ppp.htm. Submissions that do not comply with the formatting
detailed for VLDB will be rejected without review. The paper length is limited
to 6 pages.

Karl Aberer
Distributed Information Systems Laboratory LSIR
Ecole Politechnique Federale de Lausanne, Switzerland

Reto Krummenacher
Semantic Technology Institute STI
University of Innsbruck, Austria

Atanas Kiryakov
Ontotext AD, Sofia, Bulgaria

Rajaraman Kanagasabai
Data Mining Department
Institute for Infocomm Research, Singapore

Web: http://semdata.org/events/2010/vldb/
Email: reto.krummenacher@sti2.at
Phone: +43 (0)512 507 6452
Fax: +43 (0)512 507 94906452

Reblog this post [with Zemanta]