metadata extraction and content transformation

Post on 28-Nov-2014

8.872 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

In this session, we will look first at the rich metadata that documents in your repository have, how to control the mapping of this on to your content model, and some of the interesting things this can deliver. We'll then move on to the content transformation and rendition services, and see how you can easily and powerfully generate a wide range of media from the content you already have.

TRANSCRIPT

1

Metadata Extraction and Content TransformationsNick BurchSoftware Engineer, Alfresco

twitter: @gagravarr

2

Introduction – 3 Content Related Services

Covering

• Uses• Interfaces• Calling the Services• Java & JavaScript APIs• Demos• Extensions• Apache Tika

• Metadata Extractor

• Content Transformer

• Renditions

3

The Metadata Extractor Service

What, How, Why?

• For a given piece of content, returns the Metadata held within that• Document Metadata is converted into the content model• Typically used with uploaded binary files• Upload a PDF, extract out the Title and Description, save these as the properties on the Alfresco Node• Powered internally by a number of different extractors• Service picks the appropriate extractor for you• Since Alfresco 3.4, makes heavy use of Apache Tika

4

The Content Transformation Service

What, How, Why?

• Transforms content from one format to another• Driven by mime types, source and destination• Used to generate plain text versions for indexing• Used to generate SWF versions for preview• Used to generate PDF versions for web download • Powered by a large number of different transformers• Transformers can be linked together, eg .doc -> .pdf via Open Office, then .pdf -> .swf via pdf2swf• Since Alfresco 3.4, makes heavy use of Apache Tika

5

The Rendition Service

What, How, Why?

• Can turn content from one kind to another• Or can just alter some content as-is• Used to manipulate images, eg crop and resize• Used to generate HTML .docx previews in Web Quick Start• Often uses the Content Transformation Service• Replaced the Thumbnail Service• Renditions are actions

6

Apache Tika

Apache Tika – http://tika.apache.org/

• Apache Project which started in 2006• Grew out of the Lucene community, now widely used• Provides detection of files – eg this binary blob is really a word file• Plain text, HTML and XHTML versions of a wide range of different file formats• Consistent Metadata from different files• Tika hides the complexity of the different formats, and presents a simple, powerful API• Easy to use and extend

7

Metadata Extractor Service

8

Alfresco 3.3 - Supported Formats

File Formats supported out of the box

• PDF• Word, PowerPoint, Excel• HTML• Open Document Formats (OpenOffice)• RFC822 Email• Outlook .msg Email

9

Alfresco 3.4 - Supported Formats – Page 1

File Formats supported out of the box, Page 1

• Audio – WAV, RIFF, MIDI• DWG (CAD)• Epub• RSS and ATOM Feeds• True Type Fonts• HTML• Images – JPEG, GIF, PNG, TIFF, Bitmap (including EXIF where found)• iWork (Keynote, Pages etc)• RFC822 mbox Mail

10

Alfresco 3.4 - Supported Formats – Page 2

File Formats supported out of the box, Page 2

• Microsoft Outlook .msg Email• Microsoft Office (Binary) – Word, PowerPoint, Excel, Visio, Publisher, Works• Microsoft Office (OOXML) – Word, PowerPoint, Excel• MP3 (id3 v1 and v2)• CDF (Scientific Data)• Open Document Format (Open Office)• Old-style Open Office (.sxw etc)• PDF

11

Alfresco 3.4 - Supported Formats – Page 3

File Formats supported out of the box, Page 3

• Zip and Tar archives• RDF• Plain Text• FLV Video• XML• Java class files

And I probably forgot one...!

12

Calling Apache Tika

• // Get a content detector, and an auto-selecting Parser• TikaConfig config = TikaConfig.getDefaultConfig();• ContainerAwareDetector detector = new ContainerAwareDetector(• config.getMimeRepository()• );• Parser parser = new AutoDetectParser(detector);

• // We’ll only want the plain text contents• ContentHandler handler = new BodyContentHandler();

• // Tell the parser what we have• Metadata metadata = new Metadata(); • metadata.set(Metadata.RESOURCE_NAME_KEY, filename);

• // Have it processed• parser.parse(input, handler, metadata, new ParseContext());

13

Metadata Extractor – Java Use

• MetadataExtractorRegistry registry = (MetadataExtractorRegistry)context.getBean(“metadataExtracterRegistry”);

• MetadataExtracter extractor = registry.getExtracter(“application/vnd.ms-excel”);

• Map<QName, Serializable> properties = new HashMap<QName, Serializable>();

• ContentReader reader = contentService.getReader(nodeRef, ContentModel.PROP_CONTENT);

• extractor.extract(reader, properties);• System.err.println(properties);

14

Metadata Extractor – JavaScript Use

JavaScript

var action = actions.create("extract-metadata");

action.execute(document);

• Full access is not directly available

• You can’t get at the raw properties

• You can, however, trigger extraction and saving to the node easily

• Available via an action

15

Metadata Extractor – Geo Content Model

• <aspect name="cm:geographic">• <title>Geographic</title>• <properties>• <property name="cm:latitude">• <title>Latitude</title>• <type>d:double</type>• </property>• <property name="cm:longitude">• <title>Longitude</title>• <type>d:double</type>• </property>• </properties>• </aspect>

16

Metadata Extractor – Geo Mapping

• # Namespaces• namespace.prefix.cm=http://www.alfresco.org/model/content/1.0

• # Geo Mappings• geo\:lat=cm:latitude• geo\:long=cm:longitude

• # Normal Mappings• author=cm:author• title=cm:title• description=cm:description• created=cm:created

17

Demo:Geo Tagged Image in Share

18

Content Transformation Service

19

Supported Transformations

Transformations Supported in Alfresco v3.4

• Plain Text, HTML & XHTML for all Apache Tika supported text and document formats (around 30 file formats)• PDF to Image• PDF to SWF (for preview)• Office File Formats to PDF (via Open Office, using JODConverter in Enterprise)• Plain Text and XML to PDF• Zip listing to Text• Image to other Images (via ImageMagick)

20

Content Transformer and Tika

Handlers

ContentHandler handler = new BodyContentHandler();

String text = handler.toString();

SAXTransformerFactory factory = SAXTransformerFactory.newInstance();

TransformerHandler handler = factory.newTransformerHandler();

handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");

handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");

StringWriter sw = new StringWriter();

handler.setResult(new StreamResult(sw));

String text = sw.toString();

• Tika generates HTML-like SAX events as it parses

• Uses Java SAX API• Events can be captured or

transformed• Body Content Handler

used for plain text• HTML and XHTML

available• Can customise with your

own handler, with XSLT or with E4X from JavaScript

21

Content Transformer – Java Use

• ContentTransformerRegistry registry = (ContentTransformerRegistry)context.getBean(“contentTransformerRegistry”);

• ContentTransformer transformer = registry.getTransformer(“application/vnd.ms-excel”,”text/csv”, new TransformationOptions());

• ContentReader reader = contentService.getReader(sourceNodeRef, ContentModel.PROP_CONTENT);

• ContentWriter writer = contentService.getReader(destNodeRef, ContentModel.PROP_CONTENT);

• transformer.transform(reader, writer);

22

Content Transformer – JavaScript Use

JavaScript

var action = actions.create("transform");

// Transform into the same folder

action.parameters["destination-folder"] = document.parent;

action.parameters["assoc-type"] = "{http://www.alfresco.org/model/content/1.0}contains";

action.parameters["assoc-name"] = document.name + "transformed";

action.parameters["mime-type"] = "text/html";

// Execute

action.execute(document);

• Full access is not directly available

• You can’t control which property is transformed, it’s always Content

• You can control where the transformed version goes

• Triggering the transformation is easier than in Java

• Available via an action

23

Custom Tika Parsers - Interface

Interface

public interface Parser {Set<MediaType> getSupportedTypes(ParseContext context);

void parse(InputStream stream, ContentHandler handler,Metadata metadata, ParseContext context)throws IOException, SAXException, TikaException;}

• The Tika Parser interface is quite simple

• Need to provide a list of supported mime types, so that auto-detection can work

• Accept an input stream, populate the Metadata object, and fire SAX events to the supplied handler

• That’s it!

24

Custom Tika Parser – Hello World Parser

public class HelloWorldParser implements Parser { public Set<MediaType> getSupportedTypes(ParseContext context) { Set<MediaType> types = new HashSet<MediaType>(); types.add(MediaType.parse("hello/world")); return types; }

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws SAXException { XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); xhtml.startDocument(); xhtml.startElement("h1"); xhtml.characters("Hello, World!"); xhtml.endElement("h1"); xhtml.endDocument();

metadata.set("hello","world"); metadata.set("title","Hello World!"); }}

25

Custom Command Line Transformer <bean id="transformer.worker.helloWorldCMD"

class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker"> <property name="mimetypeService“><ref bean="mimetypeService"/></property> <property name="transformCommand"> <bean class="org.alfresco.util.exec.RuntimeExec"> <property name="commandsAndArguments“><map> <entry key=".*“><list> <value>/bin/bash</value> <value>-c</value> <value>/bin/echo 'Hello World - ${source}' &gt; ${target}</value> </list></entry> </map></property> <property name="errorCodes“><value>1,127</value></property> </bean> </property <property name="explicitTransformations"> <list><bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails"> <property name="sourceMimetype“><value>text/plain</value></property> <property name="targetMimetype“><value>hello/world</value></property> </bean></list> </property> </bean>

<bean id="transformer.helloWorldCMD" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer">

<property name="worker"><ref bean="transformer.worker.helloWorldCMD"/></property> </bean>

26

Custom Transformer – Demo

JS Code

var action = actions.create("transform");action.parameters["destination-folder"] = document.parent;action.parameters["assoc-type"] = "{http://www.alfresco.org/model/content/1.0}contains";action.parameters["assoc-name"] = document.name + "HW";

if(document.mimetype == "hello/world") { action.parameters["mime-type"] = "text/plain";} else { action.parameters["mime-type"] = "hello/world";}

action.execute(document);

• Use our Command Line transformer to generate a “hello/world” version

• Use our Tika transfomer to turn this back into plain text

• Uses the JavaScript API to access the content transformation service

27

Demo 2:Excel to Plain Text, CSV and HTML

28

Rendition Service

29

Standard Rendition Engines

Renditions Supported in Alfresco v3.4

• reformat – access to the Content Transformation Service• image – crop, resize, etc• freemarker – runs a Freemarker Template against the content of the node• html – turns .docx files into clean HTML + images• xslt – runs a XSLT Transformation against the content of the node, XML content nodes only!• composite – execute several renditions in a series, eg reformat followed by image crop

30

Persisted vs Transient Definitions

For your more complicated renditions

• To run a rendition, first create a rendition definition for a given rendering engine• Then set all the parameters against it• Finally execute it against a node

• For very complicated / common renditions, you can save the definition to the data dictionary• It can then be retrieved and run• Rendition Service provides support to create, load, save and execute definitions

31

Rendition Service – Calling From Java

Load, Edit, Save, Run

•// Retrieve the existing Rendition Definition•QName renditionName = QName.createQName( NamespaceService.CONTENT_MODEL_1_0_URI, "myRendDefn");•RenditionDefinition renditionDef = loadRenditionDefinition(renditionName);

•// Make some changes.•renditionDef.setParameterValue(AbstractRenderingEngine.PARAM_MIME_TYPE, MimetypeMap.MIMETYPE_PDF);•renditionDef.setParameterValue(RenditionService.PARAM_ORPHAN_EXISTING_RENDITION, true);

•// Persist the changes.•renditionService.saveRenditionDefinition(renditionDef);

•// Run the Rendition•ChildAssociationRef assoc = renditionService.render(sourceNode, renditionDef);

32

Rendition Service – Calling From JavaScript

Create, Run, List

•var renditionDef = renditionService.createRenditionDefinition("cm:cropResize", "imageRenderingEngine");•renditionDef.parameters["destination-path-template"] = "/Company Home/Cropped Images/${name}.jpg";•renditionDef.parameters["isAbsolute"] = true;•renditionDef.parameters["xsize"] = 50;•renditionDef.parameters["ysize"] = 50;

•renditionService.render(nodeRef, renditionDef);

•var renditions = renditionService.getRenditions(nodeRef);

33

Rendition Service – More Calling Options

Actions, Rules, CMIS

• Renditions are Actions, but normally hidden ones• They won’t show up in Share when defining Rules, or in Explorer for running a Custom Action

• Solution – create a JS Script, or some custom Java• Use this from your Rule / to run as an Action

• No dedicated REST API, but Renditions are available through CMIS• More details available in the CMIS talks!

34

Custom Rendition Engines

When a composite just isn’t enough

• Rendition Engines are a special kind of Action Executor• This delivers lots of flexibility, and means anyone who can write Custom Actions already knows enough to write Custom Rendition Engines!• org.alfresco.repo.rendition.executer.AbstractRenderingEngine provides a helpful superclass

• To learn more about Custom Actions and Custom Action Executors, see Neil McErlean’s talk

35

Demo 1:Crop and Resize an Image

(Using Share Rules)

36

Demo 2:Video Rendition

37

Demo 3:Word .docx -> HTML & Images

(Using Web Quick Start)

38

Any Questions?

39

Learn Morewiki.alfresco.comforums.alfresco.comblogs.alfresco.com/wp/nickb/twitter: @AlfrescoECM @Gagravarr

top related