understanding the chinese firewall (continued) dr. crandall, leif, tony, ronnie, veronika review,...
TRANSCRIPT
Understanding the Chinese Firewall (continued) Dr. Crandall, Leif, Tony, Ronnie, Veronika
Review, Chinese Firewall Maximum Entropy Point Feature Comparison with Singular Value
Decomposition
Chinese Firewall We want to monitor what words/phrases are being
censored in China We find out which words are being filtered by
probing the ”Chinese Firewall” with words that are likely to be censored Our main problem is finding the words that are likely
to be censored Challenge: Chinese characters are not like English
letters, we are dealing with Chinese text
Ex: 馬
Maximum Entropy Used for Named Entity Extraction Ex: ”Chinese government passes new law” [Beginning of Named Entity][End of Named Entity] [other] [other] [Unique Named Entity]
Build a model from a training set: our training set is the Chinese Wikipedia
Training set needs to have a specific format:Assign each word a set of featuresLabel each word as a [unique named entity],
[other], etc... Using Maximum Entropy, we can assign a probability
P(named entity) to new words based on features describing those words
Once we extract named entities from news sources, we can test whether new words are added to the ”blacklist”
Problem: Chinese text that is similar, but not exactly, the keyword
we want to test Ex:
法轮功 法十轮十功
Feature Correspondence by Singular Value Decomposition
Point Features 1:1 mapping SVD Given the point features in two images I and J,
build a proximity matrix G: G(ij) = exp(-r(ij)/2σ^2) SVD of G => G = TDU' P = TEU' If P(ij) determines whether I(i) maps to J(j)
Current Status
We are almost done labeling Chinese Wikipedia to use as a training set for our maximum entropy program
Chinese character images Point feature extraction
(Near) Future Work
Finish and test our maximum entropy model Point feature extraction Ideas: Zip files, Relaxation-based pattern
matching, Segmentation
Questions?
Longuet-Higgins H. Christopher and Scott, Guy L. (1991). An Algorithm for Associating the Features of Two Images. Proc. R. Soc. Lond. B 244, 21-26. doi: 10.1098/rspb.1991.0045
Pilu, Maurizio. (1997) Uncalibrated Stereo Correspondence by Singular Value Decomposition. HP Laboratories Bristol, Digital Media Department, HPL -97-96, August 1997
Nagasaki, Takeshi, Yanagida, Tadashi, Nakagawa, Masaki. () Relaxation-Based Pattern Matching Using Automatic Differentiation for Off-line Character Recognition
Borthwick, Andrew. Sterling, John. Agichten, Eugene. Grishman, Ralph. () Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition. New York University.