McGraw-Hill Companies
       
J.D. Power logo
 
    Services Case Studies Industries News Events Company Resources1 Insights  


Papers & Pubs
Faq
webinars
PAPERS & PUBS

Find out more about the science related to our business. View abstracts of publications co-authored by Umbria scientists.

Weblog Classification for Fast Splog Filtering: A URL Language Model Segmentation Approach
Salvetti, F. and Nicolov, N.
From Proceedings of HLT-NAACL 2006: Human Language Technology Conference, New York City, NY, USA, 2006.

Scientific Paper Abstract. This paper shows that in the context of statistical weblog classification for splog filtering based on n-grams of tokens in the URL, further segmenting the URLs beyond the standard punctuation is helpful. Many splog URLs contain phrases in which the words are glued together in order to avoid splog filtering techniques based on punctuation segmentation and unigrams. A technique which segments long tokens into the words forming the phrase is proposed and evaluated. The resulting tokens are used as features for a weblog classifier whose accuracy is similar to that of humans (78% vs. 76%) and reaches 93.3% of precision in identifying splogs with recall of 50.9%.



Local Flow Betweenness Centrality for Clustering Community Graphs
Salvetti, F. and Srinivasan, S.
In Proceedings of WINE 2005, 1st Workshop on Internet and Network Economics, Hong Kong. Springer, Lecture Notes in Computer Science.

Scientific Paper Abstract. The problem of information flow is studied to identify de facto communities of practice from tacit knowledge sources that reflect the underlying community structure, using a collection of instant message logs. We characterize and model the community detection problem using a combination of graph theory and ideas of centrality from social network analysis. We propose, validate, and develop a novel algorithm to detect communities based on computation of the Local Flow Betweenness Centrality. Using LFBC, we model the weights on the edges in the graph so we can extract communities. We also present how to compute efficiently LFBC on relevant edges without having to recalculate the measure for each edge in the graph during the process. We validate our algorithms on a corpus of instant messages that we call MLog. Our results demonstrate that MLogs are a useful source for community detection that can augment the study of collaborative behavior.



Opinion Polarity Identification of Movie Reviews
Salvetti, F., Reichenbach, C. and Lewis, S.
To appear in Computing Attitude and Affect in Text, Springer, Dordrecht, The Netherlands, 2005

Scientific Paper Abstract. One approach to the assessment of overall opinion polarity (OvOP) of reviews, a concept defined in this paper, is the use of supervised machine learning mechanisms. In this paper, the impact of lexical feature selection and feature generalization, applied to reviews, on the precision of two probabilistic classifiers (Naïve Bayes and Markov Model) with respect to OvOP identification is observed. Feature generalization based on hypernymy as provided by WordNet, and feature selection based on part-of-speech (POS) tags are evaluated. A ranking criterion is introduced, based on a function of the probability of having positive or negative polarity, which makes it possible to achieve 100% precision with 10% recall. Movie reviews are used for training and testing the probabilistic classifiers, which achieve 80% precision.



Current Trends and Techniques in Temporal Analysis
Boguraev, B. and Nicolov, N.
EUROLAN'2005

Tutorial Abstract. As more natural language processing (NLP) applications are looking to incorporate some form of temporal reasoning, computational analysis of time is becoming a prominent research topic. Temporal analysis, however, requires much more than identifying temporal expressions in text: time structures are considerably more complex than entities typically at the focus of traditional information extraction (IE) work. This is not surprising, as reasoning is a more demanding operation than e.g., template filling or gisting, but it introduces additional challenges at representational and analytical levels. While many of the lessons learnt while solving traditional IE problems are also applicable to (IE-like) aspects of temporal analysis, there are tasks in that space which require novel approaches and solutions. We will look at existing, and evolving, representational devices for computationally modeling time; we will relate these to a broad range of annotation schemes; we will consider challenges facing both human and computer annotators; and we will present a number of computational (algorithmic) strategies for temporal analysis of text documents.



Information Flow using Edge Stress Factor
Salvetti, F. and Srinivasan, S.
WWW 2005, Special interest tracks and posters, Chiba, Japan, 2005

Poster Abstract. This paper shows how a corpus of instant messages can be employed to detect de facto communities of practice automatically. A novel algorithm based on the concept of Edge Stress Factor is proposed and validated. Results show that this approach is fast and effective in studying collaborative behavior.



A Statistical Model for Multilingual Entity Detection and Tracking
Florian, R., Hassan, H., Ittycheriah, A., Jing, H., Kambhatla, N., Luo, X., Nicolov, N. and Roukos, S.
HLT-NAACL 2004: Human Language Technology conference / Annual meeting of the North American chapter of the Association for Computational Linguistics, Boston, Mass., 2004

Scientific Paper Abstract. Entity detection and tracking is a relatively new addition to the repertoire of natural language tasks. In this paper, we present a statistical language-independent framework for identifying and tracking named, nominal and pronominal references to entities within unrestricted text documents, and chaining them into clusters corresponding to each logical entity present in the text. Both the mention detection model and the novel entity tracking model can use arbitrary feature types, being able to integrate a wide array of lexical, syntactic and semantic features. In addition, the mention detection model crucially uses feature streams derived from different named entity classifiers. The proposed framework is evaluated with several experiments run in Arabic, Chinese and English texts; a system based on the approach described here and submitted to the latest Automatic Content Extraction (ACE) evaluation achieved top-tier results in all three evaluation languages.



Impact of Lexical Filtering on Overall Opinion Polarity Identification
Salvetti, F., Reichenbach, C. and Lewis, S.
Proceedings of AAAI Spring Symposium on Exploring Attitude and Affect in Text , 2004

Scientific Paper Abstract. One approach to assessing overall opinion polarity (OvOP) of reviews, a concept defined in this paper, is the use of supervised machine learning mechanisms. In this paper, the impact of lexical filtering, applied to reviews, on the accuracy of two statistical classifiers (Naive Bayes and Markov Model) with respect to OvOP identification is observed. Two kinds of lexical filters, one based on hypernymy as provided by WordNet, and one hand-crafted filter based on part-of-speech (POS) tags, are evaluated. A ranking criterion based on a function of the probability of having positive or negative polarity is introduced and verified as being capable of achieving 100% accuracy with 10% recall. Movie reviews are used for training and evaluation of each statistical classifier, achieving 80% accuracy.






 


LEARN MORE ABOUT US

  Learn about the dynamic team guiding Umbria's mission.

» More
 

  © 2008 J.D. Power and Associates, The McGraw-Hill Companies, Inc. All rights reserved. Site Map | Privacy Policy