1 monitoring message streams: algorithmic methods for automatic processing of messages fred roberts,...

1

Monitoring Message Streams:

Algorithmic Methods for Automatic

Processing of MessagesFred Roberts, Rutgers University

2

Motivation: monitoring email traffic, news, communiques, faxes, voice intercepts (with speech recognition)

MMS: Goal

Monitor huge communication streams, in particular, streams of textualized communication to automatically detect pattern changes and "significant" events

3

• Synergistic improvements in- Performance in terms of space, time,

effectiveness, and/or “insight”- Understanding the tradeoffs among these types of

improvements- Compression for efficient resource use- Representation that aids fitting models- Efficient matching of text to text and model to

text- Learning models from data and prior knowledge- Reduction in need for large amounts of training

data or labor-intensive input- Fusion of complementary filtering approaches

MMS: Overall Objectives

4

• Emphasis on “Supervised” Filtering:

- Given example documents, textbook descriptions, etc., find documents on this topic in incoming stream or past data

• Less Emphasis on “Unsupervised” Event Identification:

- Detect emergent characteristics, anomalous patterns, etc. in incoming stream of text or historical statistics on the stream

MMS: Approaches

5

• Batch filtering: All training texts processed before any texts of active interest to user

• Adaptive filtering: User trains system during use- Value of examples

for both information and training must be considered

MMS Approaches: Supervised Filtering

6

• Creating summary statistics on massive data streams

- Detect outliers, heavy hitters (most frequent items) , etc.

- Allow us to return to past without keeping raw data

• Reducing need for labeled training examples in supervised classification

- Bayesian priors from domain knowledge

- Tuning on unlabeled data

MMS Approaches: Dealing with Massive Data

7

Bayesian Logistic Regression• Using sparseness-favoring priors, our methods

have produced outstanding accuracy and fast predictions with no ad-hoc feature selection

• State of art text classification effectiveness - Recently: Highest score on TREC 2004 triage

task• Public release of our Bayesian Binary Regression

(BBR) software (500 downloads)

Accomplishments Phase II (Jan ‘04 – Sep ‘04)

Thomas Bayes

8

Bayesian Logistic Regression (cont’d)• Ability to use domain knowledge to set prior

distributions led to large improvements in effectiveness when little training data is available

• New online algorithms: online updating of Bayesian models as new data become available


9

Streaming Algorithms • New sketch-based algorithms for

detecting word frequency changes and other patterns in massive text streams

• Rapid methods for finding changing trends, outliers and deviants, rare events, heavy hitters

• Initial results using summarized data to search for meaningful answers to queries about the past

• Initial work on textual and structural patterns in informal communication networks


101

110

1

00

1

1

10

Nearest neighbor classification: Fast implementation

• Continued development of heuristics for approximate neighbor finding with an in-memory inverted index

• Our results have reduced memory by 90% and time by 90 to 99% with minimal impact on effectiveness.

• Packaged and delivered kNN software• Developing algorithms for speeding up slow but

potentially highly effective “local learning” approach- Based on training a separate logistic regression on

the neighbors of each test document!- Slow, but with many avenues to large speedups

AccomplishmentsPhase II (Jan ‘04 – Sep ‘04)

11

Adaptive Filtering• Models to Aid in Learning: When to act

greedily (“exploit” -- submit documents we believe relevant) and when to take risks

(“explore” -- submit documents that can be irrelevant)• Seek approximate solutions to the

intractable optimal exploration/exploitation tradeoff

• Experiments show slight improvements in filtering effectiveness compared to greedy (exploit-only) approach

AccomplishmentsPhase II (Jan ‘04 – Sep ‘04)

12

• Bayesian methods assume prior beliefs about parameters before data is seen

- Project Phase I: generic, vague priors- Project Phase II: Reference materials or

intuitions about words may help predict class. Use these to set priors. (Material very unlike training examples)

• Goal: reduce need for training examples- Replace 1000’s of randomly sampled

examples with few, possibly biased examples

Some MMS Work in Depth: Bayesian Priors from Domain Knowledge

13

• Reference texts have some non-topical words

- Use words that discriminate among topics (use Inverted Document Frequency (IDF) weighting within reference collection)

• Small training sets increase problems with thresholding and text representation

- Use unlabeled data to aid thresholding and to learn IDF weights

- Use separate prior for intercept term of model

Knowledge-Driven Priors: Issues

14

• Topics: 27 Reuters Region categories

• Knowledge: CIA World Factbook (WFB) entries

• Examples: 10/topic

• Baseline results (F1 measure):

• WFB: 0.234 , no WFB: 0.052

• Better small training sets, improved algorithms

• WFB: 0.591, no WFB: 0.395

Knowledge-Driven Priors: Results

15

• Reference materials of text type very different from documents to be classified can aid supervised filtering

- In combination with tuning on unlabeled data, this technique can provide immediate practical benefits

• Current methods are crude and ad hoc – substantial improvements should be possible

Knowledge-Driven Priors: Summary

16

Some MMS Work in Depth: Streaming Analysis

• Problem: Monitor fast, massive text streams and support both online tracking as well as historic analysis for events.

• Multidimensional data: source, destination, time sent or received, metadata (reply, language), text

labels (words, phrases), links.• Goal: To use highly compact summaries

that are computed at stream speed and perform accurate analyses.

17

Streaming Analysis Tool: CM Sketch• Theoretical: We have developed the CM Sketch that

uses (1/) log 1/ space to approximate data distribution with error at most , and probability of success at least 1-. – All other previously known sample or sketch

methods use space at least (1/).– CM Sketch is an order of magnitude better.

• Practical: Few 10's of KBs gives accurate summary of large data: Create summaries of data that allow historic queries to find

– Heavy Hitters (Most Frequent Items)

– Quantiles of a Distribution (Median, Percentiles etc.)

– Finding items with large changes

18

Streaming Analysis: Using Web Logs• Web logs (blogs) or regularly updated on-line

journals provide informal, opinionated, candid data that is more like email than is the web.

• We have begun to automatically collect blogs, stripping formatting and tags, ads, etc., and outputting corresponding "bag of words" into streaming algorithms for analysis, archiving. 10’s to 100 GB scale.

• 3000:1 compression using CM Sketch methods.

• Allows accurate analysis of popular words, new emergent words, etc., including multilingual occurrences.

19

• Classic method: Rocchio • Classic method: Centroid• kNN with IFH (inverted file

heuristic)• Sparse Bayesian (Bayesian with

Laplace priors)• Combinatorial PCA• Homotopic Linking of Widely

Varying Rocchio Methods• aiSVM• Fusion

Deliverables: Phase I

20

• Revised and extended version of kNN code, including scripts for running local learning experiments

• Substantially extended version of BBR, including use of domain knowledge to set priors

• CM Sketch (C library for count-min sketching)

• Code to use CM Sketch to find heavy hitters, quantiles, and large changes in streams

Deliverables: Phase II

21

Bayesian• Expand types of domain knowledge usable

- For instance, making use of the taxonomies available in many subject areas

• Improve self-tuning of BBR software- Make it more effective for novice users- Surprisingly subtle questions: Cross-

validation, calibration, scaling (e.g., when multiple features)

• Incorporate previous work on online Bayesian methods into BBR

MMS: Future Directions

22

Streaming• Systematically explore summarization methods such

as sampling, bitmaps, sketches- Develop warehousing techniques for large scale

sketch-based historical analyses• Massiveness of data implies linear algorithms too

inefficient. Seek sublinear methods.• Develop sketch-based methods for link analysis in

temporally changing multigraphs - From and To addresses in email, links between

blogs, etc. • Add modeling component to the sketch-based

analysis: Exploit knowledge of distribution of the data.


23

kNN• kNN with small training sample for each of massive

number of topics - maybe only 5 to 10 known

relevant/irrelevant documents- Since small samples have little overlap, extend kNN

approach to deal with partially labeled datasets• Bayesian kNN

- Incorporate methods developed in our Bayesian work for dealing with small training sets (e.g., tuning thresholds on unlabeled data).

- More fundamental combinations of Bayesian and kNN methods (e.g., tunable distance metrics)


24

Greedy Round Robin Feature Selection• In phase I work: Explored greedy heuristic to choose subset of original set of terms as features

- Did extremely well in TREC2002 “topic intersection tasks”

• Will develop a Greedy Round Robin (GRR) method- Applies if features fall into two or more

“conceptually distinct” sets (e.g., metadata such as source/destination, genre or medium of the message)

- Each list of features is consulted in turn.- Plan experimental analysis of GRR - Plan theoretical analysis of GRR using simulation


25

Adaptive filtering• Experiment with new adaptive thresholding

methods (synergy with Bayesian thresholding work)- Scoring threshold is adjusted downward if

judging too many irrelevant documents; upward if judging too few relevant documents

• Aim for algorithm with state-of-art effectiveness and provable theoretical properties

• Compare rate of convergence of various algorithms on real data.


26

Paul Kantor, Rutgers Communic., Info.& Library StudiesDave Lewis, ConsultantMichael Littman, Rutgers CSDavid Madigan, Rutgers StatisticsS. Muthukrishnan, Rutgers CSRafail Ostrovsky, Telcordia/UCLAFred Roberts, Rutgers DIMACS/MathMartin Strauss, AT&T Labs/U. Michigan)Wen-Hua Ju, Avaya Labs (collaborator)Andrei Anghelescu, Graduate StudentSuhrid Balakrishnan, Graduate StudentAynur Dayanik, Graduate StudentDmitry Fradkin, Graduate StudentPeng Song, Graduate StudentGraham Cormode, postdocAlex Genkin, software developerVladimir Menkov, software developer

MMS PROJECT TEAM:

1 monitoring message streams: algorithmic methods for automatic processing of messages fred roberts,...

Documents

massive data slide

new data

past data

massive data streams

supervised filtering

summarized data

raw data

little training data