yin & yang: demonstrating complementary provenance from noworkflow & yesworkflow

15
João F. Pimentel, Saumen Dey, Timothy McPhillips, Khalid Belhajjame, David Koop, Leonardo Murta, Vanessa Braganholo, Bertram Ludäscher Yin & Yang: Demonstra6ng complementary provenance from noWorkflow & YesWorkflow

Upload: bertram-ludaescher

Post on 16-Apr-2017

106 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Yin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflow

JoãoF.Pimentel,SaumenDey,TimothyMcPhillips,KhalidBelhajjame,DavidKoop,LeonardoMurta,

VanessaBraganholo,BertramLudäscher

Yin&Yang:Demonstra6ngcomplementaryprovenancefrom

noWorkflow&YesWorkflow

Page 2: Yin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflow

ProvenanceoftheYin&YangDemo

2

Dagstuhl’16ReproducibilitySeminar

Provenance-Week’16Demo!

was_derived_from__via

TaPP’15,Edinburgh

Page 3: Yin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflow

UsingProvenancefromScriptRuns

3

Examplefromthelog-file:2016-06-0720:32:36Wroterun/data/DRT240/DRT240_11000eV_002.imgButhowwasthatimagederived??(“ProvenanceforSelf!”)

Page 4: Yin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflow

module.__build_class__

module.__build_class__

simulate_data_collection

180 return

180 run_logger

201 return

201 new_image_file

230 parser

231 cassette_id

236 add_option

241 add_option

246 add_option

248 set_usage

251 parse_args251 args

251 options254 module.len

24 cassette_id

24 sample_score_cutoff

24 data_redundancy

24 calibration_image_file

30 exists

33 exists

32 filepath

34 module.remove

33 exists

32 filepath

34 module.remove

33 exists

32 filepath

34 module.remove

36 run_log

37 write

38 str(sample_score_cutoff)

38 write

38 str(sample_score_cutoff)

49 str.format

49 sample_spreadsheet_file

50 spreadsheet_rows

cassette_q55_spreadsheet.csv

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format 51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

72 str.format

72 write

73 open

73 rejection_log

74 str.format

74 TextIOWrapper.write

50 spreadsheet_rows

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format

51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

90 str.format

90 write91 sample_id

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

calibration.img

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format 93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format 106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format 93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

50 spreadsheet_rows

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format

51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

90 str.format

90 write

91 sample_id

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open119 collection_log_file 120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open 119 collection_log_file 120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format 106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file 120 module.writer120 collection_log

121 writer.writerow

92 collect_next_image

50 spreadsheet_rows

128 return

run/run_log.txt

run/rejected_samples.txt

run/raw/q55/DRT240/e10000/image_001.raw

run/data/DRT240/DRT240_10000eV_001.img

run/collected_images.csv

run/raw/q55/DRT240/e10000/image_002.raw

run/data/DRT240/DRT240_10000eV_002.img

run/raw/q55/DRT240/e11000/image_001.raw

run/data/DRT240/DRT240_11000eV_001.img

run/raw/q55/DRT240/e11000/image_002.raw

run/data/DRT240/DRT240_11000eV_002.img

run/raw/q55/DRT240/e12000/image_001.raw

run/data/DRT240/DRT240_12000eV_001.img

run/raw/q55/DRT240/e12000/image_002.raw

run/data/DRT240/DRT240_12000eV_002.img

run/raw/q55/DRT322/e10000/image_001.raw

run/data/DRT322/DRT322_10000eV_001.img

run/raw/q55/DRT322/e10000/image_002.raw

run/data/DRT322/DRT322_10000eV_002.img

run/raw/q55/DRT322/e11000/image_001.raw

run/data/DRT322/DRT322_11000eV_001.img

run/raw/q55/DRT322/e11000/image_002.raw

run/data/DRT322/DRT322_11000eV_002.img

noWorkflow:notonlyWorkflow!

4

•  Scriptshaveprovenance,too!

•  Transparentlycapturesome/allprovenancefromPythonscriptruns.

•  Usefilterqueriesto“zoom”intorelevantparts..

Page 5: Yin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflow

simulate_data_collection

230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>

251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])

251 args = ['q55']

251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>

24 cassette_id = 'q55'

24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0

24 calibration_image_file = 'calibration.img'

49 str.format

49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'

50 spreadsheet_rows(sample_spreadsheet_file)

50 sample_name = 'DRT240'50 sample_quality = 45

61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])

61 accepted_sample = 'DRT240'61 num_images = 2

61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'

92 collect_next_image(casset ... _{frame_number:03d}.raw')

92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'

106 str.format

106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')

calibration.img

run/data/DRT240/DRT240_11000eV_002.img

$nowdataflow-f"run/data/DRT240/DRT240_11000eV_002.img"5

$(NW_FILTERED_LINEAGE_GRAPH).gv: $(NW_FACTS)now helper df_style.pynow dataflow -v 55 -f $(RETROSPECTIVE_LINEAGE_VALUE) -m simulation| python df_style.py -d BT -e > $(NW_FILTERED_LINEAGE_GRAPH).gv

..auto-“make”this!

noWorkflowlineageofanimagefile

ProvenanceinformaHonaboutPythonfuncEoncalls,variableassignments,etc.

Page 6: Yin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflow

simulate_data_collection

initialize_run

run_log load_screening_results

sample_namesample_quality

calculate_strategy

accepted_samplerejected_sample num_imagesenergies

log_rejected_sample

rejection_log

collect_data_set

sample_id energyframe_number raw_image

transform_images

corrected_imagetotal_intensitypixel_count

log_average_image_intensity

collection_log

sample_spreadsheet

calibration_image

sample_score_cutoffdata_redundancy

cassette_id

YesWorkflow:Yes,scriptsareWorkflows,too!•  UseYWannotaWons

@begin...@end,@in,@outtorevealhiddenconceptualworkflow(prospec6veprovenance)

•  Scriptisn'tchanged:–  annotaWonsviacomments(=>languageindependent)

•  Forunderstandingandsharingthe“bigpicture”

•  Queryandvisualize!

6

Page 7: Yin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflow

AlternateYWViews

7

simulate_data_collection

initialize_run

load_screening_results calculate_strategy

log_rejected_sample

collect_data_set transform_images log_average_image_intensity

simulate_data_collection

initialize_run

run_log load_screening_results

sample_namesample_quality

calculate_strategy

accepted_samplerejected_sample num_imagesenergies

log_rejected_sample

rejection_log

collect_data_set

sample_id energyframe_number raw_image

transform_images

corrected_imagetotal_intensitypixel_count

log_average_image_intensity

collection_log

sample_spreadsheet

calibration_image

sample_score_cutoffdata_redundancy

cassette_id

Processview

DataviewWorkflowview

Page 8: Yin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflow

simulate_data_collection

initialize_run

run_log load_screening_results

sample_namesample_quality

calculate_strategy

accepted_samplerejected_sample num_imagesenergies

log_rejected_sample

rejection_log

collect_data_set

sample_id energyframe_number raw_image

transform_images

corrected_imagetotal_intensitypixel_count

log_average_image_intensity

collection_log

sample_spreadsheet

calibration_image

sample_score_cutoffdata_redundancy

cassette_id

Whatisthelineageof“corrected_image”?

8

Fromhereon“upwards”:Whatled(leads)tothis?

..andwhatisirrelevantandshouldbepruned??

Page 9: Yin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflow

simulate_data_collection

collect_data_set

sample_id energy frame_number raw_image

calculate_strategy

accepted_sample num_imagesenergies

load_screening_results

sample_namesample_quality

transform_images

corrected_image

sample_spreadsheet

calibration_image

sample_score_cutoff data_redundancy

cassette_id

SubgraphresulWngfromlineagequery

onYWworkflowmodel

9

Whatisthelineageofcorrected_image?

Page 10: Yin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflow

10

simulate_data_collection

initialize_run

run_log load_screening_results

sample_namesample_quality

calculate_strategy

accepted_samplerejected_sample num_imagesenergies

log_rejected_sample

rejection_log

collect_data_set

sample_id energyframe_number raw_image

transform_images

corrected_imagetotal_intensitypixel_count

log_average_image_intensity

collection_log

sample_spreadsheet

calibration_image

sample_score_cutoffdata_redundancy

cassette_id

simulate_data_collection

collect_data_set

sample_id energy frame_number raw_image

calculate_strategy

accepted_sample num_imagesenergies

load_screening_results

sample_namesample_quality

transform_images

corrected_image

sample_spreadsheet

calibration_image

sample_score_cutoff data_redundancy

cassette_id

module.__build_class__

module.__build_class__

simulate_data_collection

180 return

180 run_logger

201 return

201 new_image_file

230 parser

231 cassette_id

236 add_option

241 add_option

246 add_option

248 set_usage

251 parse_args251 args

251 options254 module.len

24 cassette_id

24 sample_score_cutoff

24 data_redundancy

24 calibration_image_file

30 exists

33 exists

32 filepath

34 module.remove

33 exists

32 filepath

34 module.remove

33 exists

32 filepath

34 module.remove

36 run_log

37 write

38 str(sample_score_cutoff)

38 write

38 str(sample_score_cutoff)

49 str.format

49 sample_spreadsheet_file

50 spreadsheet_rows

cassette_q55_spreadsheet.csv

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format 51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

72 str.format

72 write

73 open

73 rejection_log

74 str.format

74 TextIOWrapper.write

50 spreadsheet_rows

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format

51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

90 str.format

90 write91 sample_id

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

calibration.img

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format 93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format 106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format 93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

50 spreadsheet_rows

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format

51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

90 str.format

90 write

91 sample_id

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open119 collection_log_file 120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open 119 collection_log_file 120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format 106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file 120 module.writer120 collection_log

121 writer.writerow

92 collect_next_image

50 spreadsheet_rows

128 return

run/run_log.txt

run/rejected_samples.txt

run/raw/q55/DRT240/e10000/image_001.raw

run/data/DRT240/DRT240_10000eV_001.img

run/collected_images.csv

run/raw/q55/DRT240/e10000/image_002.raw

run/data/DRT240/DRT240_10000eV_002.img

run/raw/q55/DRT240/e11000/image_001.raw

run/data/DRT240/DRT240_11000eV_001.img

run/raw/q55/DRT240/e11000/image_002.raw

run/data/DRT240/DRT240_11000eV_002.img

run/raw/q55/DRT240/e12000/image_001.raw

run/data/DRT240/DRT240_12000eV_001.img

run/raw/q55/DRT240/e12000/image_002.raw

run/data/DRT240/DRT240_12000eV_002.img

run/raw/q55/DRT322/e10000/image_001.raw

run/data/DRT322/DRT322_10000eV_001.img

run/raw/q55/DRT322/e10000/image_002.raw

run/data/DRT322/DRT322_10000eV_002.img

run/raw/q55/DRT322/e11000/image_001.raw

run/data/DRT322/DRT322_11000eV_001.img

run/raw/q55/DRT322/e11000/image_002.raw

run/data/DRT322/DRT322_11000eV_002.img

simulate_data_collection

230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>

251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])

251 args = ['q55']

251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>

24 cassette_id = 'q55'

24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0

24 calibration_image_file = 'calibration.img'

49 str.format

49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'

50 spreadsheet_rows(sample_spreadsheet_file)

50 sample_name = 'DRT240'50 sample_quality = 45

61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])

61 accepted_sample = 'DRT240'61 num_images = 2

61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'

92 collect_next_image(casset ... _{frame_number:03d}.raw')

92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'

106 str.format

106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')

calibration.img

run/data/DRT240/DRT240_11000eV_002.img

lineagequerylineagequery

YesWorkflow:Conceptualworkflowmodel

noWorkflow:Pythontracemodel

Buthowdowebridgethisgap???

WouldliketouseYWmodeltoqueryNW

data!

Page 11: Yin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflow

We’reofftoseetheWizardofProv...

11

We're off to see the Wizard, The wonderf4l Wizard of Prov!

-- We hear he is a wiz of a wiz

If ever a wiz there was. --

If ever, oh ever, a wiz there was, The Wizard of Prov is one because,

Because, because, because, because, because, Because of the wonderf4l things he does.

•  EnrichYWconceptualviewwithNWPythonprovenance!

•  Getthebestofbothworlds!•  Howhardcanitbetobridge

YWandNW…(cf.TaPP’15prototype)

Page 12: Yin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflow

DiamondsareforeverBridgesaren’t…

12

Page 13: Yin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflow

…newbridge-buildingcanbestressful

13

…evenifjustpainHngover.

Page 14: Yin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflow

HabemusPons!We’vegottheBridge!Thebridgeisthejourney..(ThejourneyisthedesWnaWon)

14

LineageofimagefileintermsofYW

model,withdetailsfromNWprovenance

Page 15: Yin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflow

SecretReproducibleSauce

•  CombiningprovenanceinformaWonfromnoWorkflowandYesWorkflow

•  Usingallthegoodstuff:– make,docker,Prolog,SQL,Graphviz

•  Opensource– github.com/yesworkflow-org/yw-noworkflow– github.com/gems-uff/yin-yang-demo

•  Haveacloserlookatthedemo!

15