


|
|
PAPERS
& PUBS
Find out more about the science related to our business. View
abstracts of publications co-authored by Umbria scientists.
Weblog Classification for Fast Splog Filtering: A URL Language Model Segmentation Approach
Salvetti, F. and Nicolov, N.
To appear in Proceedings of HLT-NAACL 2006: Human Language
Technology Conference, New York City, NY, USA, 2006.
Scientific Paper Abstract. This paper shows that in the context
of statistical weblog classification for splog filtering based
on n-grams of tokens in the URL, further segmenting the URLs
beyond the standard punctuation is helpful. Many splog URLs
contain phrases in which the words are glued together in order
to avoid splog filtering techniques based on punctuation segmentation
and unigrams. A technique which segments long tokens into
the words forming the phrase is proposed and evaluated. The
resulting tokens are used as features for a weblog classifier
whose accuracy is similar to that of humans (78% vs. 76%)
and reaches 93.3% of precision in identifying splogs with
recall of 50.9%.
Local Flow Betweenness Centrality for Clustering
Community Graphs
Salvetti, F. and Srinivasan, S.
In Proceedings of WINE 2005, 1st Workshop on Internet and
Network Economics, Hong Kong. Springer, Lecture Notes in Computer
Science.
Scientific Paper Abstract. The problem of information flow
is studied to identify de facto communities of practice from
tacit knowledge sources that reflect the underlying community
structure, using a collection of instant message logs. We
characterize and model the community detection problem using
a combination of graph theory and ideas of centrality from
social network analysis. We propose, validate, and develop
a novel algorithm to detect communities based on computation
of the Local Flow Betweenness Centrality. Using LFBC, we model
the weights on the edges in the graph so we can extract communities.
We also present how to compute efficiently LFBC on relevant
edges without having to recalculate the measure for each edge
in the graph during the process. We validate our algorithms
on a corpus of instant messages that we call MLog. Our results
demonstrate that MLogs are a useful source for community detection
that can augment the study of collaborative behavior.
Opinion Polarity Identification of Movie Reviews
Salvetti, F., Reichenbach, C. and Lewis, S.
To appear in Computing Attitude and Affect in Text, Springer,
Dordrecht, The Netherlands, 2005
Scientific Paper Abstract. One approach to the assessment
of overall opinion polarity (OvOP) of reviews, a concept defined
in this paper, is the use of supervised machine learning mechanisms.
In this paper, the impact of lexical feature selection and
feature generalization, applied to reviews, on the precision
of two probabilistic classifiers (Naïve Bayes and Markov
Model) with respect to OvOP identification is observed. Feature
generalization based on hypernymy as provided by WordNet,
and feature selection based on part-of-speech (POS) tags are
evaluated. A ranking criterion is introduced, based on a function
of the probability of having positive or negative polarity,
which makes it possible to achieve 100% precision with 10%
recall. Movie reviews are used for training and testing the
probabilistic classifiers, which achieve 80% precision.
Current Trends and Techniques in Temporal Analysis
Boguraev, B. and Nicolov, N.
EUROLAN'2005
Tutorial Abstract. As more natural language processing (NLP)
applications are looking to incorporate some form of temporal
reasoning, computational analysis of time is becoming a prominent
research topic. Temporal analysis, however, requires much
more than identifying temporal expressions in text: time structures
are considerably more complex than entities typically at the
focus of traditional information extraction (IE) work. This
is not surprising, as reasoning is a more demanding operation
than e.g., template filling or gisting, but it introduces
additional challenges at representational and analytical levels.
While many of the lessons learnt while solving traditional
IE problems are also applicable to (IE-like) aspects of temporal
analysis, there are tasks in that space which require novel
approaches and solutions. We will look at existing, and evolving,
representational devices for computationally modeling time;
we will relate these to a broad range of annotation schemes;
we will consider challenges facing both human and computer
annotators; and we will present a number of computational
(algorithmic) strategies for temporal analysis of text documents.
Information Flow using Edge Stress Factor
Salvetti, F. and Srinivasan, S.
WWW 2005, Special interest tracks and posters, Chiba, Japan,
2005
Poster Abstract. This paper shows how a corpus of instant
messages can be employed to detect de facto communities of
practice automatically. A novel algorithm based on the concept
of Edge Stress Factor is proposed and validated. Results show
that this approach is fast and effective in studying collaborative
behavior.
A Statistical Model for Multilingual Entity Detection
and Tracking
Florian, R., Hassan, H., Ittycheriah, A., Jing, H., Kambhatla,
N., Luo, X., Nicolov, N. and Roukos, S.
HLT-NAACL 2004: Human Language Technology conference / Annual
meeting of the North American chapter of the Association for
Computational Linguistics, Boston, Mass., 2004
Scientific Paper Abstract. Entity detection and tracking is
a relatively new addition to the repertoire of natural language
tasks. In this paper, we present a statistical language-independent
framework for identifying and tracking named, nominal and
pronominal references to entities within unrestricted text
documents, and chaining them into clusters corresponding to
each logical entity present in the text. Both the mention
detection model and the novel entity tracking model can use
arbitrary feature types, being able to integrate a wide array
of lexical, syntactic and semantic features. In addition,
the mention detection model crucially uses feature streams
derived from different named entity classifiers. The proposed
framework is evaluated with several experiments run in Arabic,
Chinese and English texts; a system based on the approach
described here and submitted to the latest Automatic Content
Extraction (ACE) evaluation achieved top-tier results in all
three evaluation languages.
Impact of Lexical Filtering on Overall Opinion
Polarity Identification
Salvetti, F., Reichenbach, C. and Lewis, S.
Proceedings of AAAI Spring Symposium on Exploring Attitude
and Affect in Text , 2004
Scientific Paper Abstract. One approach to assessing overall
opinion polarity (OvOP) of reviews, a concept defined in
this paper, is the use of supervised machine learning mechanisms.
In this paper, the impact of lexical filtering, applied
to reviews, on the accuracy of two statistical classifiers
(Naive Bayes and Markov Model) with respect to OvOP identification
is observed. Two kinds of lexical filters, one based on
hypernymy as provided by WordNet, and one hand-crafted filter
based on part-of-speech (POS) tags, are evaluated. A ranking
criterion based on a function of the probability of having
positive or negative polarity is introduced and verified
as being capable of achieving 100% accuracy with 10% recall.
Movie reviews are used for training and evaluation of each
statistical classifier, achieving 80% accuracy.
|
|
 |
|
|
|
 |
| LEARN MORE ABOUT US
|

| |
Learn about the dynamic team guiding Umbria's
mission.
» More |
|
 |
|