r&d group 开发 以人为本 交流 创造价值 liqi gao text operations
Post on 26-Dec-2015
422 Views
Preview:
TRANSCRIPT
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
Liqi Gao
Text Operations
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
Utilities
C/C++ Library Perl (Active Perl) Regular Expression Edit Plus / Ultra Edit Excel
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
C/C++ Language
Standard library: Read a line Remove a CR or LF Split a line
C++ Boost Library Case Conversion Trimming Replace Algorithm Finding Algorithm Split
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
C/C++: Read a Line
Though it’s simple, it’s useful! Three methods:
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
C/C++: Remove CR/LF
Get a line under Windows and Linux platform
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
C/C++: Remove CR/LF (cont.)
The noising CR Carriage Return
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
C/C++: Split a Line
Split a line by a specific character
H E L L O W O R L D !
H E L L O W O R L D !
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
C/C++: Split a Line (cont.)
Split a line
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
C++ Boost: Case Conversion
to_upper: Convert a string to upper case to_lower: Convert a string to lower case
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
C++ Boost: Trimming & Replace
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
C++ Boost: Split
split(): splits the input into parts
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
Regular Expression
Regular expression is a powerful tool for string operations.
operator Explain Example
* 0 or more times b, be, bee, beee,…be*
? 0 or one time be,b be?
+ 1 or more times be, bee, beee…be+
[] any of enclosed [A-Z]
^ none of any char [^a-z]
() group (abc)+
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
An Example
*\([0-9/ ]+\) *[0-9\.\?]+% empty ^( *)([0-9]+)( *) \2\t
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
An Introduction to Perl
Excels at pattern search and text manipulation (Practical Extraction and Reporting Language)
Open source / free software Cheap! Free and available for all systems can use and install without restriction open source promotes portability vastly expandable through freely available modules (add-
on libraries at CPAN repository) fewer restrictions/lower cost for commercial use can buy fancy development tools if desired centralized source, linear development path avoids vendor
vicissitudes and incompatibilities!
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
Perl is not compiled
#include <stdio.h>
int main(){ float x; x = 6e9; printf(“Hello world!\n”); printf(“All %d of you!\n”, x);}
10001110110011000111011100001110111000110111011000111000110111010100110111001011001101101101010101000111001110001101010101101010101001011101011101100011111000 ...
CCompiler
CCompiler
#!/usr/bin/perl
$x = 6e9;print “Hello world!\n”;printf “All %d of you!\n”, $x;
PerlInterpreter
PerlInterpreter
Hello world!All 6000000000 of you!
Source Code•Plain text (ASCII)•Human readable•Human editable•Platform Independent
C (compiled)
Binary Executable•NOT human readable•NOT human editable•NOT platform independent!
CCompiler
CCompiler
Perl is not compiled
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
A Taste of Perl: print a message
#!/usr/bin/perl -w - command interpretation header
$x = 6e9; - variable assignment statement
print “Hello world!\n”;
printf “All %d of you!\n”, $x; } - function calls(output statements)
perltaste.pl: Greet the entire world.
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
Scalar Values
Numerical Values integer: 5, “3”, 0, -307 floating point: 6.2e9, -4022.33 hexadecimal/octal: 0x0d4f, 0477 NOTE: all numerical values stored as floating-point
numbers (usu. “double” precision)
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
String Values
Double-quoted: interpolates (replaces variable name/control character with it’s value)
Single-quoted: no interpolation done (as-is)
Quoting operators: qq//, qw//, etc.
$day = “Monday”;
“Happy Monday!\n” Happy Monday!<NL>“Happy $date!\n” Happy Monday!<NL>
‘Happy Monday!\n’ Happy Monday!<NL>‘Happy $date!\n’ Happy $date!\n
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
String ManipulationConcatenation
$dna1 = “ACTGCGTAGC”;$dna2 = “CTTGCTAT”;
juxtapose in a string assignment or print statement$new_dna = “$dna1$dna2”;
Use the concatenation operator ‘.’ $new_dna = $dna1 . $dna2;
Add segments serially using incremental concatenation:
$new_dna = $dna1; $new_dna .= $dna2;
(shorthand for: $new_dna = $new_dna . $dna2;)
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
Substitution
DNA transcription: T U
Substitution operator s//:$dna = “GATTACATACACTGTTCA”;
$rna = $dna;
$rna =~ s/T/U/;# “GAUUACAUACACUGUUCA”
Exercise: Start with $dna =“gattACataCACTgttca”;
and do the same as above. Print out $rna to the screen.
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
transcribe.pl:$dna =“gattACataCACTgttca”;$rna = $dna;$rna =~ s/T/U/g;print "DNA: $dna\n";print "RNA: $rna\n";
Does it do what you expect? If not, why not?
Patterns in substitution are case-sensitive! What can we do?• Convert all letters to upper (or lower) case (preferred when possible)• If we want to retain mixed case, use transliteration operator tr//
$rna =~ tr/tT/uU/;
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
Case conversion
$string = “acCGtGcaTGc”;Upper case:
$dna = uc($string); # “ACCGTGCATGC” or $dna = uc $string; or $dna = “\U$string”;
Lower case:$dna = lc($string); # “accgtgcatgc” or $dna = “\L$string”;
Sentence case:$dna = ucfirst($string) # “Accgtgcatg
c” or $dna = “\u\L$string”;
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
Perl in NLP
Look up in Dictionary Word Frequency Chinese Word Segmentation POS …… Whatever you could need
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
Case study
R&D Group R&D Group
开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值
Thanks for your attention
top related