HOME
Registration
Course Plan
Readings
Participants
Venue
Accomodation
Contact
Organizers
Contact:
borsmose@ruc.dk
|
PhD Course: Concept Analysis and Concept
Based Retrieval
Participants - Project descriptions
- Alessio Bosca, Politecnico di Torino, Italy.
- Antoine Doucet, University of Helsinki, Finland.
- Bonino Dario, Politecnico di Torino, Italy.
- Davide Martinenghi, Roskilde University, Denmark.
- Eija Airio, University of Tampere, Finland.
- Ekaterina Mhaanna, Copenhagen Business School, Denmark.
- Frédéric HALLOT, Royal Military Academy, Belgium.
- Gunn Inger Lyse, University of Bergen, Norway.
- Henrik Bulskov, Roskilde University, Denmark.
- Henrik Oxhammar, Stockholm University, Sweden
- Jaak Simm, University of Tartu, Estonia.
- Jenny Eriksson, Uppsala University, Sweden.
- Jesper Mathiassen, Roskilde University, Denmark.
- Jesper Vinther Christensen, Technical University of Denmark.
- Kadri Vider, University of Tartu, Estonia.
- Kean Huat Soon, University of Muenster, Germany.
- Kendall Lister, The University of Melbourne, Austalia.
- Lone Bo Sisseck, Copenhagen Business School, Denmark.
- Olatz Ansa, University of the Basque Country
- Päivi Pasanen, University of Helsinki, Finland.
- Paulo Gottgtroy, Auckland University of Technology, New Zealand.
- Pirkko Saatsi, University of Tampere, Finland.
- Puay Leng Lee, Strathclyde University, UK
- Rasmus Knappe, Roskilde University, Denmark.
- Sotiris Rompas, University Of Glasgow, Scotland, UK.
- Thomas Terney, Roskilde University, Denmark.
- Till C. Lech, CognIT as / University of Bergen, Norway.
Alessio Bosca, Politecnico di Torino, Italy.
Title: Analysis, specific and development of systems and architectures
for the distributed elaboration
During the last years, in the research field of software engineering
and intelligent systems new algorithms and instruments have been developed,
in order to develop complex software systems that are characterized by
a continuous increase in amount and structural complexity of the data
deal with and thus require self managing capabilities.
A peculiar interest is covered by the study of models, methodologies
and instruments for the next generation web, as the Semantic Web, Ontologies,
MultiAgent Systems and Autonomic Computing in order to realize platforms
for the distribution of services related to innovative Web Intelligence
applications. Those technologies are deeply linked, in fact the core value
added by the semantic web resides just in the machine understandable paradigm,
so software agents are the future direct consumers of those resources
and the deputies of such high-level information processing. Furthermore
such technologies find their natural place in sectors of high socio-economic
spin-off as e-Government and e-Health
Antoine Doucet, University of Helsinki, Finland.
Title: Exact coverage information retrieval based on static passage-clustering
Since a few decades, the amount of digital information stored in computer
systems and networks has been exponentially growing. Using automated data
collection tools, massive amounts of data have been collected and stored
into databases. Among this rising quantity of data, the proportion consisting
in textual data has been constantly increasing. Consequently, automatic
and efficient methods have been developed for searching for specific information
within textual data.
Unfortunately, these algorithms do not take any extra information into
account. Information such as the structure of a document would definitely
improve the relevance of document descriptors. Since this data is nowadays
included in most textual documents, an obvious need for including this
contextual information within the algorithms has now appeared.
My first goal is to extend the set of document features, so as to include
hierarchical document information. The hypothesis here, is that the hierarchical
strucure of a document (e.g., its subdivisions in sections, subsections,
paragraphs, subparagraphs, etc.) carries semantically meaningful information.
In current text processing systems, this information if often skipped,
or used very partially (for example, by increasing the weight affected
to a word, when it is used in a title element).
By taking this extra data into account, many difficulties are arrising.
They mostly are computational. By example, in the case of geographical
information, a way to include non-sequential information (such as maps,
images or tables) to a process based on a typically sequential data (words
used in natural language). How to mix these totally different data is
a real problem; Indeed, the occurrence of a map within a document often
carries the same information, be it lying before or after a given word
sequence.
When thinking of adding information to the document, using a morphological
analyser, the potential amount of new information is huge. A way to select
relevant morphological information has to be thought of. Then, a way to
represent this information has to be found. A first way might be to replace
each word by a feature vector including the corresponding data. But taking
care of many information for a same word is another big issue, since the
method is so far based on a single information (the word itself) per index
of the sequence.
This basically defines the main challenge of this thesis, i.e., adapt
ideas based on sequential and single-valued information towards semi-sequential
and multi-valued information.
Bonino Dario, Politecnico di Torino, Italy.
The Ph.D research activity involves the usage and the extension of the technologies
proposed by the "Semantic Web" research community, to develop
new architectures and algorithms for intelligent web sites and services.
Those applications are able to provide high level interaction by the integrating
many techniques coming from Artificial Intelligence, in particular "Computational
Intelligence", and from the Evolutionary Algorithms research field.
The developed tools will be applied to different real world problems involving
usability, wireless interfaces, search engines and multimedia information
retrieval.
With respect to the first Ph.D year, research activities involved two main
topics which are related to the Web Intelligence field and the Semantic
Web. The first one, includes the design and the preliminary implementation
of an evolutionary prediction system, that aims at correctly predicting
the next requested page on a web server, different application scenarios
have been investigated, specifically with regard to the dynamic Web area.
The second branch involved the proposal, design and implementation of a
semantic platform able to provide automatic semantic annotation facilities
and semantic search functionalities for existing web resources. The architecture
is able to work at different level of detail from the single paragraph in
a web page to entire chapters in a book, as an example, and is well integrated
with already existing semantic and lexical nets like WordNet.
Davide Martinenghi, Roskilde University, Denmark.
The purpose of the PhD project is to develop a general model for evaluation
of database integrity constraints with possible emphasis on incremental
maintenance and automatic or semi-automatic generation of update routines.
In particular, a procedure for simplification of integrity constraints
can been defined in order to specialize integrity checking for the (classes
of) updates that are given to the database. Furthermore, a simplification
can be characterized as optimal if it corresponds to a minimum in an ordering
that represents the effort of evaluating the specialized constraints.
It can be shown that no optimal simplification procedure can exist, due
to unavoidable undecidability issues.
The best we can do is to find good approximations of optimal procedures
and applications in which they prove useful. Simplification of integrity
constraints comes in handy, e.g., for data integration, semantic query
optimization, abductive reasoning and data mining; but integrity constraints
can be used to express semantic knowledge in a number of different contexts,
such as text processing and concept analysis.
Eija Airio, University of Tampere, Finland.
The subject of my dissertation is cross-language information retrieval
(CLIR), which is a subset of information retrieval (IR). In CLIR, the request
is represented in a language which differs from the document language(s).
There are two kinds of CLIR tasks: bilingual and multilingual tasks. The
first are tasks with only one target language, while the latter deals with
tasks with multiple target languages. My research deals with both bilingual
and multilingual CLIR tasks.
There are two main approaches in CLIR: either to translate the queries
into the target languages or to translate the documents into the source
language. The first approach is easier and cheaper to implement, and I
apply it in my research as well. There are many ways to translate the
queries. I apply bilingual dictionaries. The dictionary-based approach
is simple: all the words (except stop-words), are translated into the
target language(s), and retrieval is performed with the translated query.
Ontogies are useful in IR in two ways: 1) they allow the user to navigate
in order to find the relevant concepts for retrieval, and, 2) they offer
a tool for query expansion. In CLIR, ontologies may be used in many different
ways. For example, the user could navigate the ontology in the source
language, searching the right concepts and expanding the query, after
which the query is translated by a bilingual dictionary. An alternative,
though more demanding approach would be to create multilingual ontologies.
I will apply the first of these approaches in my research, and possibly
the latter as well, if there will be resources for that.
Ekaterina Mhaanna, Copenhagen Business School, Denmark.
Frédéric HALLOT, Royal Military Academy of Belgium.
This thesis, is at its very beginning and therefore, the final direction
it will go to is not very clear yet.
With this research, I want to focus on the possibility to make ontologies
independent of natural languages.
To get this concrete, I want to work on an existing project.
I already developed an online multilingual examination system.
Currently this system is multilingual in the sense that all the different
linguistic versions of the questions and answers are very nicely stored
in a normalized relational database. The system is quite complete. Indeed
it allows questions with textual answers, but also multiple choice questions.
The examinations are randomly composed in the language of the student.
All the multiple choice answers are automatically corrected by the system.
For the textual answers there are three possibilities, the student didn’t
give a response and the system consider the answer as faulty, the student
give literally the same answer as the one stored in the system and the
system grant the answer as correct, otherwise the teacher must compare
both answers and decide himself about the correctness of the answer.
Possible improvements using multilingual ontologies:
• Conversion of one typical answer for a textual question in an
ontology.
• Conversion of the student’s answer in an ontology, and then
trying to match the ontologies, taking synonyms in account.
• Ideally, the student’s answer could be in any natural language
(supported by the system) while the typical answer could be stated in
the teacher’s language, and still the matching of ontologies should
be possible, because then the multilingual aspect of ontologies (and thus
also web services) would be achieved.
Gunn Inger Lyse, University of Bergen, Norway.
Title: Translationally derived information about lexical semantics
for WSD purposes
In recent years, there has been an increasing interest in the use of
translational data as a source of information about lexical semantics.
The Mirrors method (Dyvik, 2003) utlises translational (corpus) data in
order to derive the sense-distinctions of words as well as the semantic
relations between the resulting senses (such as synonymy and hypo-/hypernymy).
It is not easy to evaluate the Mirrors method, as it is virtually impossible
to produce an intersubjective «gold-standard» against which
it may be compared.
The goal of this project is to use the practical task of Word Sense Disambiguation
(WSD) in order to evaluate how the Mirrors method lends itself as a source
of paradigmatic lexical information within a practical NLP-task. Concretely,
the project addresses the sparse data problem associated with corpus-based
machine-learning approaches to WSD: How to produce sufficiently informative
sense-tagged data as training material for stochastic machine-learning
(ML) algorithms?
Corpus-based WSD-approaches require large amounts of data, since they
are generally limited to observations on concrete co-occurrence patterns
between a word sense and its context words. Intuitively, however, the
presence of a word a (say, fish) in the context of the target sense in
question (e.g., course in its food sense) also implies that the target
sense is likely to co-occur with a´s semantic relatives (e.g., salmon
as a hyponym to fish), even if such semantic relatives are not actually
exemplified in the training corpus. Therefore, an interesting approach
would be to generalise context information from unanalysed words to classes
of semantically related word senses. In the Mirrors wordnet, semantic
relatedness is encoded through (translationally derived) semantic features:
The more closely two senses are, the more featuresthey share. The basic
idea of the project is to base machine learning on the semantic features
present in the target sense context, rather than on the set of unanalysed
context words, i.e., producing co-occurrence patterns not between words
but classes of words: All words that share some instantiated feature.
Henrik Bulskov, Roskilde University, Denmark.
The topic of my PhD is information retrieval and information extraction.
The main issue is how to refine query evaluation by use of ontological knowledge,
and how to extract information from a text corpus to support this. Instead
of the traditional word-based descriptions of text objects, I want to create
semantic descriptions by shallow natural language parsing and semantic extraction.
This is sought done by recognition of simple noun phrases and subsequently
extraction of relations between these noun phrases, to form simple and compound
concepts. The semantic descriptions, sets of concepts, extracted from the
text objects, are used as the basis for the query evaluation, and the same
technique is sought used on the query definitions, thus having the same
type of descriptions for text object in the information base and query definitions.
The main goal for the query evaluation is to find a representation for the
semantic descriptions and a similarity measures taking the semantic knowledge
into account. The idea here is to use a simple representation, which can
be mapped directly into a ontology, and then use distance in the ontology
to calculate the similarity and/or relatedness between concepts. As a overall
requirement are, that both the extraction and the query evaluation is scalable
to large information bases, and that the knowledge, primary ontology and
linguistic information, not are bound to some specific knowledge base, but
rather can use different knowledge sources, for instance found on the Internet,
by a translation into a simple generic format.
Henrik Oxhammar, Stockholm University, Sweden.
This project aims at developing, implementing and testing a system
that identifies company names and product names in web pages, and maps the
product names to a standardized classification scheme (the Common Procurement
Vocabulary (CPV)).
The system would limit manual identification, extraction and classification
and improve consistency of classification decisions. If employed in a semi-automatic
setting, the system will learn from the classification decisions and improve
its decisions with the number of learning instances.
The intention is to tackle the recognition and classification of product
names in the following way:
* Product Name Recognition: First, a pattern-based approach will be
used for finding generic product terms (e.g. alkaline batteries) and true
product names (e.g. The Energizer) in English documents using lexicons,
part-of-speech tags, orthography (i.e capitalization) etc. and special
characters (e.g., parentheses, slash, '&'). Subsequently, the pattern-based
approach will be complemented by a machine learning approach.
* Product Name Classification: Product names will be matched to nodes
in the hierarchy using a vector space approach. In order to map product
names into the hierarchy, the node descriptions will be broadened by building
"clouds'' of semantically similar items around them. This will first
be done based on existing thesauri (e.g. WordNet). Eventually the intention
is to use methods from Information Retrieval for automatic construction
of similarity thesauri.
Jaak Simm, University of Tartu, Estonia.
The goal of my PhD work is to examine and explore possibilities of
creating architectures capable of developing dynamic ontology for their
internal representation. The idea is to view intelligent systems as a series
of ontological transformations of information. An ontological transformation
transforms information from one ontology to another. To have capabilities
of dynamic ontology is to have the ability to perform dynamic ontological
transformations (instead of statical transformations that are available
to intelligent system during its creation).
For that reason I plan to analyze the way knowledge representation techniques
transform ontology. Thus, I try to create a framework how ontological commitments
are made and evaluate the efficiency of these ontological commitments. Such
framework can provide insights for developing architectures with dynamic
ontology capabilities.
Jenny Eriksson, Uppsala University, Sweden.
Jesper Mathiassen, Roskilde University, Denmark.
The area of my project is Computational
Linguistics. The focus will be on partial parsing and the idea is to describe a syntax
analysis component of a conceptual analysis of domain specific text
corpora. In
this sense, the syntax analyzer should form part of an information retrieval system in
the sense
outlined in the project description of the OntoQuery project
(http://www.ontoquery.dk/description/). This system facilitates information extraction
and
query answering with background in a taxonomically ordered ontology.
The automatic analysis of text within this project aims at generating descriptors, and
the semantics of phrases such as "symptoms
due to lack of vitamin-B" is represented with descriptors of the kind examplified below:
symptom[CBY : lack[WRT : vitaminB]]
A feature structural representation where the sort-labels are formed
by concepts, while WRT (with respect to) and CBY (caused by) are relation
names which relate concepts in the ontology shows the general format that the analysis
is targeting.
- One of the key ideas in QntoQuery is that a conceptual grammar is used as a supplement to the
linguistic grammar.
An interesting perspective is to focus on the use of the conceptual grammar
in the parse process with the purpose of applying rule sets that will filter away
parse trees corresponding to phrases that are conceptually marginal or incorrect in relation to
the domain ontology. Hence, while in (at least Danish) political discourse, it
would be sensible to
talk about a 'red proposal', this makes little sense with other domains as example.
However, at present, parsers have no way to incorporate such restrictions. In other words, an
adjective and a noun would be recognized as a noun phrase irrespective of whether
it is sensible or not. Therefore, such a technique would be quite helpful in relation to
the problem of ambiguity.
Jesper Vinther Christensen, Technical University of Denmark.
Title: Specification of geographic information
Geographic information is an abstract representation of reality. That
is widely accepted. Specification, of what is called the universe of discourse,
defines the interpretation of reality to form representation that can
be captured in computer-based systems. It is crucial to be able to express
the universe of discourse, i.e specifications, in a way that adds conceptual
transparency and clarity such that these can be used as solid foundations
for both producing and using geographic information.
My PhD concerns formal specification of geographic information with focus
on topographic maps and place names. The aim for the project is to establish
a better understanding of what geographic information is, how it is specified,
and represented in computer-based systems. An important aspect of geographic
information is classification of individuals and specification of concept
hierarchies. Defining roles that restricts relations among individuals
is also an important task. Using the concept hierarchies and defined roles,
it is possible to specify rules that must hold for representation of individuals.
At least this is the project’ hypotheses.
The motivations for establishing a framework for writing formal specifications
for geographic information are many. It gives authors a predefined and
structured scheme for writing specifications, which minimize the risk
for introducing contradictive and unclear rules. Formal specifications
in a well-known structure supply users of geographic information with
access to detailed metadata that can be queried and presented as needed.
Formal specification of geographic information allows validation processes,
which can decide if some information conforms to a specification. Formal
specification makes interchanges of specifications between different systems
possible, hereby the implementation of specification in different software
is easier, and more important the risk of implementing a specification
in different ways are minimized.
Kadri Vider, University of Tartu, Estonia.
Title: Word Sense Disambiguation of Verbs According to Lexical-Syntactic
Information
The aim of the dissertation is to study the means, how to disambiguate
verb senses according to corpus material, and what lexical-semantic functions
(according to Fillmore's FrameNet and Mel'cuk's theory of lexical functions)
these verb senses have.
The practical output of my dissertation will be a formally consistent
corpus-based lexical-semantic database that describes the usage of Estonian
language on the lexical-semantic level, where the main attention is turned
on senses of verbs.
I have long practical experience in developing Estonian WordNet (EstWN,
site in EC project EuroWordNet-2, 1998-1999) and our research group have
remarkable experiences in word sense disambiguation tasks (participation
in SensEval-2, 2001). Attempt to disambiguate word senses in CELL (Corpus
of Estonian Literary Language) according to word senses in Estonian wordnet
referred to inconsistencies in splitting words into senses. It rises a
question, how much wordnet word sense (as just one member in synset) is
reliable in (con)text and how much word senses created manually by lexicographers
are covered in real usage.
Constructing lexical-semantic database of verbs from text corpus data
requires to take into consideration, that verbs behave differently from
nouns. There is mostly one verb with its argument structure per sentence.
Verbs are not clearly distinctable into senses as nodes in hypernymy/hyponymy
trees, rather they differs in manner.
Our WSD system works with EstWN hyponym/hypernym hierarchies, taking into
consideration the distances between the nodes corresponding to the word
senses in the EstWN tree as well as the density of the tree. Results in
WSD of verbs are bad by reason of marked above. My idea is to improve
also our WSD system with supporting knowledge about Estonian verb senses
and their argument structure influencing disambiguation.
Kean Huat Soon, University of Muenster, Germany.
My PhD project studies the development of user tasks ontologies and
ensures the developed user tasks ontologies could feasibly be mapped with
the content ontologies that describe the domain knowledge of the database
schemata or text corpus. A document depicts the user tasks in a particular
domain is selected in this study. The verbs and noun phrases from the
document, which represent the actions and goals of user tasks, are manually
extracted. In order to forming the user task ontology, a concrete conceptual
structure is constructed from the extracted information. The conceptual
structure of user tasks is formalized with Formal Concept Analysis (FCA),
a method well suited in the analysis of data. The user tasks hierarchy
then represented in concept lattice from a cross tabulation. In this study,
we propose a method where the verbs of the tasks represented as verbal
adjectives (verbs with suffix “-able”). The verbal adjectives
treated as the formal attributes, whereas the goals of the tasks (noun
phrases) defined as the formal objects of the concepts. Owing to the focus
of this study is activities-centered rather than object-centered, the
notion of implication between attributes is used with the assumption that
the action of one particular user task implies other user tasks to be
accomplished.
The developed user tasks ontologies from the concrete conceptual structure
must ensure that they can map with the domain knowledge. However, this
merely depends on how reliable and extendable of the conceptual structure
from the formalization. Hence, in this course, I thirst for the answers
of the following matters:
• The automation method of concept based retrieval in achieving
better retrieval results and building substantial ontologies from the
resources such as corpus, database schemata, documents;
• Constructing a concrete conceptual tasks hierarchy where the concentration
is between the actions as attributes of the formal concepts;
• The potentials of FCA applied in mapping between two different
ontologies.
Kendall Lister, The University of Melbourne, Austalia.
The unprecedented speed with which information is being created and made
available via current information technology has produced a sea of disparate
data sources that do not interoperate (or do so only at a relatively shallow
level, such as data visualisation techniques). Interpreting the meaning
of these many different sources is left to human analysts who must spend
valuable time sorting through them to ‘hand-extract’ the relevant
information. The Structured Knowledge Source Integration (henceforth ‘SKSI’)
project at Cycorp (Austin, Texas) has been developed to address this issue.
Current SKSI tools enable the Cyc knowledge base to integrate (i.e. to access,
to perform complex queries over, to assimilate and to merge) a variety of
external structured knowledge sources, such as databases, spreadsheets,
XML or DAML tagged text, GIS datasets and web pages, with the rich, multi-contextual
Cyc ontology (expressed in the specially-tailored language, ‘CycL’)
acting as the mediating lingua franca.
Such knowledge source integration is achieved declaratively via a Schema
Mapping Language (SML) within CycL. Using this language, assertions are
made in the knowledge base with respect to a given knowledge source regarding:
its ‘physical schema’ (e.g. How many columns does a given
table consist of?), it’s ‘logical schema’ (e.g. What
information do those columns represent?) as well as its access paths,
privileges and update frequency. A few such simple assertions are all
that is required for the knowledge source to be reasoned over just as
if it were part of the Cyc knowledge base. Sources successfully integrated
can be ‘dynamically generated’. For example, it is now possible
to ask Cyc the current weather in Austin. At present, however, sources
have to be found and ‘hand-declared’ in advance in order to
be used by the Cyc KB. There is a need for functionality that is more
intelligent and proactive in this regard.
We believe that progress in this area may best be made by studying web
pages. Web pages, in many cases, form an interesting middle ground between
the highly structured knowledge sources with which SKSI tools already
deal competently and (currently intractable) natural language (NL) sources,
in that they tend to carry their own semantic declarations – in
NL, but often in very simple phrases (for instance, in the form of tables
or lists of information with simple headers on the columns and sections).
In order to build functionality to proactively identify and add knowledge
sources, we wish to exploit the software agent paradigm, which has emerged
as a potential aid for interacting with knowledge on the Internet. There
is no consensus on an exact definition of the term ‘agent’,
though see (Wooldridge and Jennings, 1995), (Ndumu and Nwana, 1997). However,
one important type of agent is an independent piece of software which
can locate a relatively small amount of accurate information for the end-user,
in part by mimicking how a human, knowledgeable about the domain, would
seek that information (Sterling, 1997, 1999).
This project will build on work already done by the Intelligent Agent
Laboratory at the University of Melbourne in prototyping a range of information
agents over the past 5 years. Significant insight has been gained by the
Intelligent Agent Laboratory as to when a knowledge-based approach to
building software agents may be successful. The key characteristic of
a suitable domain is that there is a variety of pages in differing formats
but there is some common overall structure. Too much structure reduces
the problem to known methods. Too little structure reduces the problem
to natural language understanding which is currently too difficult. Domains
successfully prototyped include finding paper citations, sports scores,
subjects offered in universities, classified advertisements for cars and
real estate, and legal information. It has been possible to develop the
information agents in a way that can be generalised to some extent from
domain to domain (Loke et al, 1999). These agents however have not fitted
within an overall ontology.
The aim of this research, therefore, is to advance knowledge-source integration
technology by exploring ways in which agents can automatically find interesting
web sites on particular topics and automatically generate suitable mappings
for them which integrate them within a large-scale, general ontology (in
this case, the Cyc ontology).
Lone Bo Sisseck, Copenhagen Business School, Denmark.
Olatz Ansa, University of the Basque Country.
In the context of the increasing importance of lexicons in Natural
Language Processing, we have considered the need to build a lexical knowledge
base for Basque. We are interested in developing a general lexical-semantic
framework, in which all type of relations (even multilingual and complex
ones) are incorporated (Agirre et al., 2003).
Recently, two works have been carried out in order to create lexical-semantic
resources for the Basque language:
i) The Basque WordNet resource, that is carried out in the context of the
EuroWordNet project (Agirre et al., 2002), and
i) The extraction of relations from the analysis of a Basque monolingual
dictionary (Agirre et al., 2000).
The Basque WordNet and the concepts extracted from the monolingual dictionary
are going to be mapped. Our purpose is to enrich the Basque WordNet resource
with the relations stored in the lexical knowledge base, and vice versa.
Indirectly we hope to disambiguate extracted relations from the monolingual
dictionary via this mapping.
Moreover, we want this lexical resource to be usable in practical applications.
The application we are thinking of is a multilingual question-answering
one. This system will receive questions written in Basque, and the answers
will be obtained from multilingual corpora (Basque, Spanish, English).
At present, we are working on the design of this application and discussing
in detail how and where this lexical-semantic information can be used to
improve system results.
Päivi Pasanen, University of Helsinki, Finland.
Title: Terminological data in maritime safety texts and methods
of their extraction
Traditionally, existing standards, dictionaries and glossaries have served
as research material for researchers of terminology. Therefore, there
are no satisfactory language-independent methods for extracting terminological
data from texts. In addition to this, as a prescriptive approach has been
predominant in terminology, concept development or term variations have
attracted only a little attention. However, in reality concepts change,
and term variation is common.
The objectives of this work are, firstly, to research terminological data
embedded in texts, and, secondly, to test the applicability and usefulness
of certain methods used for extracting terminological data from texts.
Consequently, proposals to develop new methods will be presented.
The research material consists of specialized texts in the subject domain
of maritime safety. International and national regulations, textbooks,
conference papers, research reports and articles from professional journals
in Finnish and in Russian are included.
The research will be conducted in three phases. The first phase consists
of the application of certain term extraction methods. These are manual
term identification and semi-automatic term extraction, the latter of
which will be carried out by using two commercial computer programmes.
The results of term extraction will be compared and the accuracy and precision
of the methods will be evaluated.
During the second phase, other terminological data such as concept relations
and characteristics will be retrieved from the texts. It has been argued
that certain linguistic expressions could be used to identify terminological
data from texts. In this study these expressions, which some researchers
call knowledge probes, will be applied to identify concept relations and
characteristics.
The third phase consists of the comparative analysis of concepts in two
languages. The analysis will be performed by the examination of characteristics
identified during the second phase of the study. Concepts will be studied
diachronically and the function of term variation will be considered.
The research will provide new information of the applicability of terminological
methods to the extraction of terminological data from texts. The results
may be applied to terminology work in special fields and to the education
of translators and field professionals.
Paulo Gottgtroy, Auckland University of Technology, New Zealand.
Title: An ontology driven bioinformatics annotation tool
Problem Definition:
Sequence alignments provide a powerful way to compare novel sequences with
previously characterized genes. Statistical measures results indicate the
quality of matches. Often the statistical and biological significance are
related. Sometimes, however matches of real biological significance have
low statistical scores.
Objective:
To build an automatic annotation tool that considers both statistical and
the semantic analysis of the definition presented in the alignments results.
Proposal Description:
Extract relevant information from small corpus, such as presented in the
definition sentences of alignments results, is a challenge task and is based
mainly on the information extraction techniques. Most of the current algorithms
of information extraction use a large Lexus and huge training set to extract
useful information and relationships among entities. However in the case
of sequence analysis, the results are not homogeneous, the definitions are
short, and sometimes the quality of the annotations is not trustworthy.
To overwhelm these restrictions we are proposing a methodology that uses
ontologies as knowledge representation. Furthermore we are using different
levels of ontologies, such as domain, application and task ontologies, to
allow reuse and different inference from the same information. The ontologies
are used as a semantic dictionary to guide our concept recognition process.
To implement this solution we are considering the current bioinformatics
annotation processes in order to acquire previous knowledge and represent
the application domain. The resulted annotations are going to be classified
and stored in an ontological representation that is going to be reused in
terms of rule based inferences that will guide the selection of target data
sources.
The annotation tool is going to be implemented as a knowledge acquisition
tool that will support the population of the knowledge base for further
analysis.
The proposal includes the development and implementation of an integrated
ontology that includes gene ontology as domain ontology, and others ontologies
built from the current annotation process to support the workflow and the
concepts involved in the application domain. The prototype is going to include
an information extraction process and an automatic ontology annotation.
Pirkko Saatsi, University of Tampere, Finland.
At the moment I work at the University of Tampere in National Ontology
Project in Finland. I am constructing ontology to help information retrieval
process. Ontology’s subject is food products, food production and
food supervision.
The ontology construction is based on three level model of information
retrieval. The levels are conceptual, linguistic and string level. The
concepts and relationships (generic, partitive and associative relationships)
among them are represented in conceptual level. The linguistic level contains
the natural language expression(s) of each concept. The matching patterns
of each expression are given in a query-language independent way in the
string level.
Puay Leng Lee, Strathclyde University, UK.
My research interest is in the use of conceptual structures
to enable and support Information Retrieval (IR). The PhD
project is interested in the dynamic construction of FCA
lattices in response to user queries, for the purpose
of retrieving documents from large, non domain-specific
text collections. We conjecture that FCA lattices enable
efficient, effective retrieval of documents from such
collections, enhancing retrieval performance. This is
motivated by potential benefits of lattices to IR such as:
the lattice information space naturally incorporates the
navigational as well as querying spaces; there are multiple
paths to a given document. The project seeks practical
solutions to problems due to size and lack of domain-specificity
of the collection. The implemented FCA-IR system will be tested
against more conventional IR systems in a user evaluation study.
Rasmus Knappe, Roskilde University, Denmark.
Title: Ontology-based Similarity Measures
The main topic is similarity measures in connection with content-based
query evaluation. The aim is to use the knowledge from a domain-specific
ontology covering the domain of a given information base to obtain better
and closer answers on a semantic basis. This is sougth done by devising
a similarity measure that utilizes the different relations and the structures
of the ontology to calculates similarity between a query and objects in
the information base.
Sotiris Rompas, University Of Glasgow, Scotland, UK.
Title: Visualization for web post – retrieval clustering
As the World Wide Web has been expanded in size greater than any expectations,
currently 5 billion pages (Lawrence & Giles 1998), and continues to
increase rapidly, extended research has been taken place to identify effective
ways on allocating information on the Web. Search engines and directories
such as Google and Altavista were created in order to aid users with their
web retrieval tasks. As things stand, users get elongated lists of web
documents in a form of ranked list as a result. Identifying the appropriate
information becomes a tedious task, as the user should browse most of
the results returned in order to identify any appropriate information.
Sophisticated information retrieval algorithms have been developed in
order to increase the efficiency of the search engines (in terms of precision
and recall). Furthermore, clustering techniques have been implemented
in various search engines such as Vivissimo (www.vivissimo.com) and Dogpile
(www.dogpile.com) to increase the information retrieval efficiency by
supplying meaningful groups of information (clusters) to the user in order
to increase the speed of allocating any appropriate information. Extended
research has taken place on clustering algorithms for web post retrieval
in order to increase the efficiency of the clustering. Furthermore increased
effort has taken place in order to apply various visualization techniques
that could visualize the information space of a retrieval task. These
visualizations fail to satisfy users in terms of browsing, as usually
the information space of such a task contains a very large amount of documents.
We are taking a different approach by trying to combine both worlds (IR
and Information Visualization) and visualize the clusters of a post-retrieval
clustering task instead of visualizing the retrieved information space
as a whole. We hypothesize here that such visualization will increase
the overall efficiency, and also minimizing the search time, of the search
process. Our visualization interface, VisOC, creates a simple graphical
representation of the clusters generated for a specific retrieval task
in such a way that the user can easily access every cluster generated
without being “lost” in the retrieved information space.
The main aim of VisOC is to minimize user clicks and also increase the
efficiency of the search process by giving the user a “picture”
of the available information for her query. The user can access a cluster
by clicking on it in order to access sub clusters. At the lowest level
of the clustering hierarchy simple icons are used to represent the retrieved
documents in order to increase the familiarity process with the interface.
At any time the user can view the entire information space without the
need of a mouse click.
Thomas Terney, Roskilde University, Denmark
Till C. Lech, CognIT as / University of Bergen, Norway.
Title: Ontology-based Co-reference Chaining
The proposed PhD project is a part of the KunDoc research project, funded
by the Norwegian Research Council (KUNSTI-framework). KunDoc is cross-disciplinary
project, aiming at combining language technology with methods related
to knowledge-based systems and the semantic web.
The goal of the proposed work is to develop a method for knowledge-intensive
co-reference chaining by means of domain-specific ontologies encoded in
RDFS.
There have been some knowledge-based approaches to co-reference chaining,
such as Wilks’ preference semantics, Schank’s scripts/frames
or semantic networks. However, the lack of available world knowledge has
been pointed out as a major drawback with these kinds of methods. In recent
years methods and tools for knowledge-based systems in general, and semantic
web initiatives in special have made major progress, providing powerful
representation languages and concept extraction tools.
The proposed approach aims at transferring results from a linguistic
and statistical corpus analysis into explicit domain-specific knowledge
represented in RDFS. Special focus is directed towards how predicate-argument
structures in sentences can be mapped into concepts and relations on the
domain ontology – which in turn will deliver semantic information
for the analysis of unknown documents from the same domain.
Methods and Tools
The work is planned to be carried out in three phases: In the first phase,
a domain specific corpus will be processed, and the predominating concepts
will be extracted. In the next step, the predicates employed by the domain
concepts will be statistically evaluated in order to cluster the concepts
semantically and store the concept-predicate relations in the domain ontology.
In the third step, the knowledge stored in the ontology will be used
for the analysis of unknown texts. If an NP’s referent in a text
is unclear, it will be – along with its predicate – fed into
the ontology in order to find an appropriate referent.
In order to extract the concepts, the Onto-Extract tool made by CognIT
as in the On-To-Knowledge Project will be used, together with the Oslo-Bergen
Tagger, a rule-based PoS-Tagger developed at the University of Oslo and
AKSIS, Bergen. The predicate-argument analysis will executed by means
of the NORGRAM parser, developed at the University of Bergen.
|