the clarin-nl & clarin-vl web services project · the clarin-nl & clarin-vl web services...
TRANSCRIPT
The CLARIN-NL & CLARIN-VL Web Services Project
Marc Kemps-Snijders Meertens Institute
Ineke Schuurman K.U. Leuven
Freudenstadt 2010-11-16
CLARIN Mission
Mission: create an infrastructure which makes
language resources (annotated recordings, texts,
lexica, ontologies) and technology (speech recognizers,
lemmatizers, parsers, summarizers, information extractors) available and readily usable to scholars of all disciplines, in particular the Humanities and Social Sciences.
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
HOW ARE WE GOING TO DO THAT?
Building an infrastructure via CLARIN-centers Demonstration projects Technical and financial support Include as many institutions as possible in Netherlands and Flanders
Strategy
Participants Clarin-NL
Utrecht Institute for Linguistics OTS Landelijke Onderzoeksschool Taalkunde
Max-Planck-Institute for Psycholinguistics Meertens Instituut (KNAW) Huygens Instituut (KNAW)
Data Archiving and Networked Services (KNAW) Fryske Akademy (KNAW)
Digitale Bibliotheek voor de Nederlandse Letteren Instituut voor Nederlandse Lexicologie
Centre for Language and Speech Technology Centre for Language Studies
Amsterdam Center for Language and Communication
Center for Language and Cognition Centre for Linguistics
Tilburg Centre for Creative Computing Human Media Interaction Group Katholiek Documentatie Centrum
Koninklijke Bibliotheek Veteraneninstituut
Taal en Communicatie Vrije Universiteit Instituut voor Beeld en Geluid
Nederlands Instituut voor Oorlogsdocumentatie Aletta (Instituut voor Vrouwengeschiedenis
Participants Clarin-VL
Centrum voor Computerlinguïstiek (CCL), K.U.Leuven.
Interdisciplinary research on Technology, Education & Communication (itec), K.U.Leuven
(Kortrijk) Center for Computational Linguistics and
Psycholinguistics (CLiPS), Universiteit Antwerpen Elektronica en Informatiesystemen (ELIS-DSSP),
Universiteit Gent ESAT-PSI Spraak K.U.Leuven
Laboratory for Digital Speech and Audio Processing ETRO-DSSP, Vrije Universiteit Brussel
Language Intelligence and Information Retrieval (HMDB-LIIR), K.U.Leuven
Language and Translation Technology Team (LT³), Hogeschool Gent
CLARIN centers - Netherlands
Max Planck Ins,tute
Meertens Ins,tute
INL
CLARIN centers provide
(CLARIN compatible) data
and software that can be
used for all researchers of
participating institutions.
DANS
Huygens Ins,tute
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
CLARIN centers – Flanders (in preparation)
University of Antwerp
Catholic University of
Leuven
CLARIN centers provide
(CLARIN compatible) data
and software that can be
used for all researchers of
participating institutions.
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
Clarin-NL Call 1 Data Curation & Demonstrator Projects
Involve a targeted user and address the user’s research questions
Open call for small subprojects (.5 yr / max. 60k Euro each) 12 projects running, some already finished Will make available
a range of curated resources a range of showcases of CLARIN functionality evidence-based requirements and desiderata – for the
CLARIN infrastructure and – for supported standards and best practices
2009 Name Description
AAM-LR Automatic Annotation of Multi-modal Language Resources
Adelheid A Distributed Lemmatizer for Historical Dutch
ADEPT Assaying Differences via Edit-Distance of Pronunciation Transcriptions
DUELME-LMF Converting DUELME into LMF format
INTER-VIEWs Curation of Interview Data
MIMORE Microcomparative Morphosyntax Research Tool
SignLinC Linking lexical databases and annotated corpora of signed languages
TICCLops Text-Induced Corpus Clean-up online processing system
TDS Curator A web-services architecture to curate the Typological Database System
TQE Transcription Quality Evaluation
WFT-GTB Integrating the Wurdboek fan 'e Fryske Taal into the Geïntegreerde Taalbank
New in 2010 Second Open Call for data curation & demonstrator projects Directly assigned projects
Selection to be based (inter alia) on results of user survey Budget available: 400k Euro – Initial selection before end
2010
2010
Clarin-NL Call 2 Data Curation & Demonstrator Projects
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
Infrastructure Implementation Project (IIP, 3 yrs) infrastructure services, an open archiving service, registries,
federation of centers, set up a schema registry, profile matching, ISOCAT maintenance, add relation registry RELCAT.
coordinate and give guidance for work on web services, wrapper and service bus specification and implementation, select work flow tools and experiment with them.
Clarin-NL Infrastructure projects
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
Introduction Nijmegen
2010-07-01
www.clarin.eu
Metadata project (MD, 0.5 yr) Testing CMDI against existing national data Create initial set of required metadata components Results: (see http://www.clarin.eu/cmdi) – Component
Registry – CMDI XML Toolkit – Documentation for users and developers – Metadata Tutorial held
Search & Develop (S&D, 3 yrs) centralized metadata search distributed content search – Text based and structured search National extension of the European Demonstrator project
Clarin-NL Infrastructure projects
Clarin-VL and Clarin-NL
TTNWW (3 yrs) Cooperation between Netherlands and Flanders Implement user friendly workflow services for
Text enriching text corpora with annotations For literature researchers (Huygens) and archeologists
(Sagalassos) Speech
indexing and search of (a limited set of) audio and video data For social historians (Aletta, KDC, KADOC, M2P)
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
TTNWW web services
TEXT Text preprocessing - Tilburg University
Lexical Unit tagging - Tilburg University
Shallow parsing - University of Antwerp/Tilburg University
NER and coreference – University College Ghent
Semantic role annotations – Utrecht University
Spatiotemporal analysis – Catholic University of Leuven
SPEECH Speech recognizers - Catholic University of Leuven/University of
Twente
Segmentation, transcription, alignment – Ghent University/Catholic University of Leuven/University of Twente/Radboud University
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
TTNWW workflow (simplified)
Input: Machine readable
texts Text preprocessing
Lexical Unit Tagging
Full Parsing Shallow Parsing
Alignment
NER and Coreference
Spatiotemporal analysis
Output: Analyzed
Text
ASR (Automatic
Speech Recognition) ASR resources:
lexicon, acoustic model, language model
Resource adaptation
Output: transcription
with recognition errors and incomplete punctuation
WEB Input: context relevant texts,
previous recordings, ..
Input: Audio (archives,
musea, interviews, …)
TEXT SPEECH
Path strongly depends on quality of transcriptions
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
Questions
How to make these services available? How to construct workflows from these services? How to accommodate for different data formats?
Text (D-COI/SONAR is de facto standard in NL/VL ) Speech + text
How to accommodate for other CLARIN requirements? Metadata for web services Metadata for workflow specifications Metadata and provenance data generation Authorization and authentication
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
Making services available CMD Service specification
In Clarin each service must be described using CMDI Web service metadata is harvested to central MD repository using standard OAI-PMH protocol
REST, SOAP or XML/RPC
Reference to WSDL or WADL file
Reference to format specification, e.g. schema file
Alternative format name, e.g. TCF Reference to format
element, e.g. schema element
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
CMD Service specification
Each service metadata document must contain a Service component
and may be further enhanced with descriptive information such as organization information
The Service component describes the operations, input and output parameters and characteristics of resource being used in the operation
Two types of parameters are distinguished: ConfigurationParameter: describes configuration settings ResourceParameter: describes resource characteristics
expressed in the TechnicalMetadata component
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
CMD Service specification
TechnicalMetadata contains information on General technical characteristics of a resource, e.g. mime
type, character encoding,.. is extensible to describe different resource types
Content related information ContentEncoding Structure indication, e.g. schema reference for XML Structural element presence (which structural elements are
actually present in the resource) AnnotationLevel TechnicalMetadata is expected to be present in resource
metadata to enable profile matching
AnnotationLevel contains additional information relevant to the structural element E.g. PartOfSpeech may contain additional information on
tagset and tagset language
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
How to create service CMDI documents
Generate skeleton CMDI document from WSDL or WADL Example: Language Identifier service from RACAI using
WSDL
CMDI metadata document editing
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
<?xml version="1.0" encoding="UTF-8"?> <wsdl:definitions xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/" xmlns:tm="http://microsoft.com/wsdl/mime/textMatching/" xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/" xmlns:mime="http://schemas.xmlsoap.org/wsdl/mime/" xmlns:tns="http://tempuri.org/" xmlns:s="http://www.w3.org/2001/XMLSchema" xmlns:soap12="http://schemas.xmlsoap.org/wsdl/soap12/" xmlns:http="http://schemas.xmlsoap.org/wsdl/http/" targetNamespace="http://tempuri.org/" xmlns:wsdl="http://schemas.xmlsoap.org/wsdl/"> <wsdl:types> <s:schema elementFormDefault="qualified" targetNamespace="http://tempuri.org/"> ……………………………………………………………… </s:schema> </wsdl:types> <wsdl:message name="IdentifyLanguageSoapIn"> <wsdl:part name="parameters" element="tns:IdentifyLanguage"/> </wsdl:message> <wsdl:message name="IdentifyLanguageSoapOut"> <wsdl:part name="parameters" element="tns:IdentifyLanguageResponse"/> </wsdl:message> <wsdl:portType name="LangIdWebServiceSoap"> <wsdl:operation name="IdentifyLanguage"> <wsdl:input message="tns:IdentifyLanguageSoapIn"/> <wsdl:output message="tns:IdentifyLanguageSoapOut"/> </wsdl:operation> </wsdl:portType> <wsdl:binding name="LangIdWebServiceSoap" type="tns:LangIdWebServiceSoap"> <soap:binding transport="http://schemas.xmlsoap.org/soap/http"/> <wsdl:operation name="IdentifyLanguage"> <soap:operation soapAction="http://tempuri.org/IdentifyLanguage" style="document"/> <wsdl:input> <soap:body use="literal"/> </wsdl:input> <wsdl:output> <soap:body use="literal"/> </wsdl:output> </wsdl:operation> </wsdl:binding> <wsdl:service name="LangIdWebService"> <wsdl:port name="LangIdWebServiceSoap" binding="tns:LangIdWebServiceSoap"> <soap:address location="http://www.racai.ro/WebServices/LangId.asmx"/> </wsdl:port> </wsdl:service> </wsdl:definitions>
Service Operation
Input message
Output message
Service name Service location
CMDI and WSDL Example: Language Identifier service from RACAI using WSDL
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
WSDL and message schema Example: Language Identifier service from RACAI using WSDL
<wsdl:definitions xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/" xmlns:tm="http://microsoft.com/wsdl/mime/textMatching/" xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/" xmlns:mime="http://schemas.xmlsoap.org/wsdl/mime/" xmlns:tns="http://tempuri.org/" xmlns:s="http://www.w3.org/2001/XMLSchema" xmlns:soap12="http://schemas.xmlsoap.org/wsdl/soap12/" xmlns:http="http://schemas.xmlsoap.org/wsdl/http/" targetNamespace="http://tempuri.org/" xmlns:wsdl="http://schemas.xmlsoap.org/wsdl/"> <wsdl:types> <s:schema elementFormDefault="qualified" targetNamespace="http://tempuri.org/"> <s:element name="IdentifyLanguage"> <s:complexType> <s:sequence> <s:element minOccurs="0" maxOccurs="1" name="text" type="s:string"/> <s:element minOccurs="1" maxOccurs="1" name="modern_languages" type="s:boolean"/> <s:element minOccurs="1" maxOccurs="1" name="rare_languages" type="s:boolean"/> </s:sequence> </s:complexType> </s:element> <s:element name="IdentifyLanguageResponse"> <s:complexType> <s:sequence> <s:element minOccurs="0" maxOccurs="1" name="IdentifyLanguageResult" type="tns:LangIDResult"/> </s:sequence> </s:complexType> </s:element> <s:complexType name="LangIDResult"> <s:sequence> <s:element minOccurs="0" maxOccurs="1" name="Language" type="s:string"/> <s:element minOccurs="1" maxOccurs="1" name="Confidence" type="s:double"/> </s:sequence> </s:complexType> </s:schema> </wsdl:types>
Message element: Language
Message element: Confidence
Message element: text (Resource)
Message element: modern_languages (Configuration)
Message element: rare_languages (Configuration)
Input message
Output message
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
Example input/output message Example: Language Identifier service from RACAI using WSDL
<IdentifyLanguage> <text>The text for which the language to be identified goes here...</text> <modern_languages>true</modern_languages> <rare_languages>false</rare_languages>
</IdentifyLanguage>
Message element: Language
Message element: Confidence
Message element: text (Resource)
Input message
Message element: modern_languages (Configuration)
Message element: rare_languages (Configuration)
<LangIDResult> <Language>english</Language> <Confidence>94.8</Confidence>
</LangIDResult>
Output message
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
CMDI skeleton generated from WSDL <?xml version="1.0" encoding="UTF-8"?> <CMD xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="example-md-schema.xsd"> <Header/> <Resources> <JournalFileProxyList/> <ResourceProxyList> <ResourceProxy> <ResourceType>WSDL service</ResourceType> <ResourceRef>http://www.racai.ro/webservices/LangId.asmx</ResourceRef> </ResourceProxy> </ResourceProxyList> </Resources> <Components> <Service> <Type>SOAP</Type> <Name>LangIdWebService</Name> <URL> http://www.racai.ro/webservices/LangId.asmx?WSDL </URL> <Operation> <Name>IdentifyLanguage</Name> <Action>http://tempuri.org/IdentifyLanguage</Action> <Input> <Parameter> <Name>IdentifyLanguage.text</Name> </Parameter> <Parameter> <Name>IdentifyLanguage.modern_languages</Name> </Parameter> <Parameter> <Name>IdentifyLanguage.rare_languages</Name> </Parameter> </Input> <Output> <Parameter> <Name>IdentifyLanguageResponse.Language</Name> </Parameter> <Parameter> <Name>IdentifyLanguageResponse.Confidence</Name> </Parameter> </Output> </Operation> </Service> </Components> </CMD>
Input ‘parameter’
Output ‘parameter’
WSDL Location
Service Location (this should be a PID)
Operation Name
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
Parameter requirements Resource
<text>The text for which the language is to be identified goes here...</text> Message element: text
(Resource)
<text> requirements: Text/plain UTF-8 encoded
These are not represented in WSDL
These characteristics are important for profile matching Current state of discussion:
Specify using CMDI Specify using external type system
Input message
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
CMD for web services (refresh)
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
How to specify TechnicalMetadata?
<TechnicalMetadata>
<MimeType>text/xml</MimeType> <CharacterEncoding>UTF-8</CharacterEncoding>
<ContentEncoding> <URL>hdl:tcf_reference</URL> <ResourceFormat>Text Corpus Format</ResourceFormat> <AnotationLevel> <Identifier>TextCorpus.POSTags.tag.token</Identifier> <TagSet>STTS</TagSet> <Language>de</Language> </AnnotatonLevel> </ContentEncoding>
</TechnicalMetadata>
<TechnicalMetadata>
<MimeType>text/plain</MimeType> <CharacterEncoding>UTF-8</CharacterEncoding>
<ContentEncoding> <URL>hdl:plain_text_reference</URL> <ResourceFormat>Plain Text</ResourceFormat> </ContentEncoding>
</TechnicalMetadata>
Describing plain text resource Describing XML resource with POStoken element characteristics
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
Where to get the right component?
<CMD_Component name=“Text Corpus Format"> <CMD_Element name=“MimeType" ValueScheme="string“/> <CMD_Element name=“CharacterEncoding " ValueScheme="string“> </CMD_Component>
<CMD_Component name=“Plain text"> <CMD_Element name=“MimeType" ValueScheme="string“/> <CMD_Element name=“CharacterEncoding " ValueScheme="string“> </CMD_Component>
hdl:plain_text_reference
hdl:tcf_reference
e.g. Components are associated with nodes in ontology of resources See e.g. MyGrid service ontology
TextCorpus.text …
TextCorpus.POStags.tag …
TextCorpus.sem_lex_rels.antonymy.orthform
<CMD_Component name=“TCF-POS"> <CMD_Component name=“TagSet"> <CMD_Element name=“Name“ ValueScheme="string“ /> <CMD_Element name=“Language“
ValueScheme="string“ ConceptLink="http://www.isocat.org/datcat/DC-1766"/>
</CMD_Component> </CMD_Component>
MimeType= text/xml CharacterEncoding = ?
MimeType= text/plain CharacterEncoding = ?
Tagset= STTS language = de Smart web service metadata editor picks up
Components and guides metadata creation process
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
How to put this to use
Profile matching Each resource must specify TechnicalMetadata component Match is made against TechnicalMetadata of service input
parameters Metadata and provenance data generation
The resulting resource of each service invocations must have metadata and provenance data
Options: Service wrapper Enterprise Service Bus
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
Clarin Service Bus Principle
Client Web
service CSB
CDMI
Resolve Service Identifier Resolve parameter resource identifiers Retain incoming metadata documents
CMDI
Create metadata documents • inline: reuse metadata incoming resource • standoff: create new metadata document
provenance
Create provenance data • service • Operation • Input parameter values • Output parameter values
Client calls service and receives result
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
Clarin Service BUS design (Component diagram)
Implemented using Apache ServiceMix 4
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
Clarin Service Bus Interface
CSB publishes generic web service interface with one method: Invoke Invokes service on behalf of the client Input
Service identifier Operation Service parameters
Output Service response
(Injected) WSDL for each service is published instead of original WSDL or WADL
Service identifier PID to service metadata
document
Operation SOAPAction from WSDL
document
Service parameters Original
Resource is injected here using the
resource metadata PID
Provenance data Example <?xml version="1.0" encoding="UTF-8" standalone="no"?> <Entries> <Entry date="Thu Nov 11 09:14:01 CET 2010"> <Input> <To>hdl:serviceMetadata</To> <Action>http://tempuri.org/IdentifyLanguage</Action> <serviceParameters ……..> <q0:IdentifyLanguage> <q0:text>hdl:resourceMetadata</q0:text> <q0:modern_languages>true</q0:modern_languages> <q0:rare_languages>false</q0:rare_languages> </q0:IdentifyLanguage> </serviceParameters> </Input> <Output> <IdentifyLanguageResponse xmlns="http://tempuri.org/"> <IdentifyLanguageResult> <Language>hdl:1289463241309</Language> <LanguageEn>English</LanguageEn> <LanguageNative>English</LanguageNative> <Confidence>75.237260220415891</Confidence> </IdentifyLanguageResult> </IdentifyLanguageResponse> </Output> </Entry> </Entries>
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
Further actions in this area
Authentication is still an issue Talks with GEANT: beginning of 2011 solution may become
available based on STS. Use brokered authentication with a security token issued by a Security Token Service (STS). The STS is
trusted by both the client and the Web service to provide interoperable security tokens.
Client Web
service CSB CSB GEANT AAI
STS Security
Token Service
Token transformation • SAML • OAUTH • SLCS Workspaces are still being discussed Requires authentication In Netherlands: work together with projects such as BigGrid
and CATCHPlus
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
SOA and workflow
resource tool resource+
tool resource++
etc. manual selection, adaptation etc
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
TTNWW Workflow engine selection
Selection criteria Data flow rather than control flow Possibility for nested workflow
Created workflows act as building blocks to other end users Support for both REST and SOAP Capable of handling heterogeneous data formats
Text (D-COI/SONAR) Speech data (audio + transcription)
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
Taverna workbench workflow design
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
Taverna workbench workflow execution and progress
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
Execution scenarios
1. Download workflow specification
2. Execute locally
Local Execution 1 a. Access workflow specification
2. Execute remotely
1 b. Access workflow specification
Remote execution
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
User interface (Initial design)
CMDI for workflow specifications
Predefined workflows are considered as special resources (See e.g. MyExperiment.org) withTechnicalMetadata specifying workflow language etc With Input and Output parameter characteristics
Workflow engine (as a service) Defines which workflows it is capable of processing Optionally defines extra parameters for processing
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
Tying it all together
Metadata for web services Metadata for workflow specifications Metadata services
Clarin metadata repository (harvesting using OAI-PMH) Becomes available towards end 2010
Workflow system Metadata and Provenance data generation
provided that AAI and workspaces are resolved.
Web service workshop Freudenstadt
2010-11-16
www.clarin.eu
Thank you for your attention
CLARIN has received funding fromthe European Community's Seventh Framework Programme
under grant agreement n° 212230