yesworkflow: more provenance mileage from hybrid provenance models and queries
TRANSCRIPT
YesWorkflow:MoreProvenanceMileagefromHybridProvenanceQueries
BertramLudä[email protected]
Director,CenterforInformaticsResearchinScience&Scholarship(CIRSS)SchoolofInformationSciences(iSchool@Illinois)
&NationalCenterforSupercomputingApplications(NCSA)&DepartmentofComputerScience(CS@Illinois)
Research-Showcase'16
IsthePriceRight?!
• Oneofthesehasbeensoldfornearly$180million!• Theothercould beworthasmuchormore...• Whichiswhichandwhatisthedifference?
Research-Showcase'16
Transparency,ReproducibilityinComputational- andData-Science
• Whatinput data wentintothisstudy?
• Whatmethods wereused?• …withwhat parameter
settings, calibrations,…?
• Canwetrust thedataandmethods?
§ Data Provenance (datalineage):origin andprocessinghistoryofdataè trust,transparency,dataquality;audittrail;attribution,credit
Research-Showcase'16
DataProvenance:Usefulbuthardtocomeby…
Research-Showcase'16
Climate Change Impacts in the United States
U.S. National Climate AssessmentU.S. Global Change Research Program
ADataONE search(here:“grass”)yieldsdifferentpackageswithprovenance
YesWorkflow:Prospective andRetrospectiveProvenanceforScripts
• YWannotations(comments)inR,Python,MATLAB,…scriptsrevealthehiddenworkflow
• Rungraphqueriesonprospective-,retrospective-,andhybrid-provenancegraphs
• Provenance-- notonly“forothers”butprovenanceforself!
cassette_id
sample_score_cutoff
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
YW!
Research-Showcase'16
Research-Showcase'16
C3_C4_map_present_NA
initialize_Grass_Matrix
Grass_variable
fetch_SYNMAP_land_cover_map_variable
lon_variable lat_variable lon_bnds_variable lat_bnds_variable
generate_netcdf_file_for_Grass_fraction
[data21] Grass_fraction_data
outputs/SYNMAP_PRESENTVEG_Grass_Fraction_NA_v2.0.nc
[data7] SYNMAP_land_cover_map_data
inputs/land_cover/SYNMAP_NA_QD.nc
C3_C4_map_present_NA
examine_pixels_for_grass
C3_Data
fetch_SYNMAP_land_cover_map_variable
lon_variable lat_variable lon_bnds_variable lat_bnds_variable
fetch_monthly_mean_precipitation_data
Rain_Matrix
fetch_monthly_mean_air_temperature_data
Tair_Matrix
generate_netcdf_file_for_C3_fraction
[data19] C3_fraction_data
outputs/SYNMAP_PRESENTVEG_C3Grass_RelaFrac_NA_v2.0.nc
[data7] SYNMAP_land_cover_map_data
inputs/land_cover/SYNMAP_NA_QD.nc
[data12] mean_airtemp
inputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.2.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.9.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.3.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.10.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.4.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.11.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.5.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.12.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.6.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.7.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.1.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.8.nc
[data14] mean_precip
inputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.3.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.10.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.4.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.11.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.5.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.12.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.6.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.7.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.1.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.8.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.2.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.9.nc
C3_C4_map_present_NA
fetch_SYNMAP_land_cover_map_variable
lon_variable lat_variable lon_bnds_variable lat_bnds_variable
fetch_monthly_mean_air_temperature_data
Tair_Matrix
fetch_monthly_mean_precipitation_data
Rain_Matrix
initialize_Grass_Matrix
Grass_variable
examine_pixels_for_grass
C3_Data C4_Data
generate_netcdf_file_for_C3_fraction
[data19] C3_fraction_data
outputs/SYNMAP_PRESENTVEG_C3Grass_RelaFrac_NA_v2.0.nc
generate_netcdf_file_for_C4_fraction
[data20] C4_fraction_data
outputs/SYNMAP_PRESENTVEG_C4Grass_RelaFrac_NA_v2.0.nc
generate_netcdf_file_for_Grass_fraction
[data21] Grass_fraction_data
outputs/SYNMAP_PRESENTVEG_Grass_Fraction_NA_v2.0.nc
[data7] SYNMAP_land_cover_map_data
inputs/land_cover/SYNMAP_NA_QD.nc
[data12] mean_airtemp
inputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.5.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.9.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.2.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.1.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.6.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.10.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.3.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.7.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.11.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.4.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.8.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.12.nc
[data14] mean_precip
inputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.11.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.4.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.8.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.1.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.12.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.5.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.9.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.2.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.6.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.10.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.3.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.7.nc
Hybridprovenancegraph
ProvenanceupstreamofC3output
ProvenanceupstreamGrass_fractionoutput
Demonstration
SUMMARY:ProvenanceforReproducibleScienceExample:PaleoclimateReconstruction
R&DOpportunities:• provenancemodels• querylanguages• systemimplementations• …DB&KR&tools
Research-Showcase'16
simulate_data_collection
initialize_run
run_log load_screening_results
sample_namesample_quality
calculate_strategy
accepted_samplerejected_sample num_imagesenergies
log_rejected_sample
rejection_log
collect_data_set
sample_id energyframe_number raw_image
transform_images
corrected_imagetotal_intensitypixel_count
log_average_image_intensity
collection_log
sample_spreadsheet
calibration_image
sample_score_cutoffdata_redundancy
cassette_id
simulate_data_collection
collect_data_set
sample_id energy frame_number raw_image
calculate_strategy
accepted_sample num_imagesenergies
load_screening_results
sample_namesample_quality
transform_images
corrected_image
sample_spreadsheet
calibration_image
sample_score_cutoff data_redundancy
cassette_id
module.__build_class__
module.__build_class__
simulate_data_collection
180 return
180 run_logger
201 return
201 new_image_file
230 parser
231 cassette_id
236 add_option
241 add_option
246 add_option
248 set_usage
251 parse_args251 args
251 options254 module.len
24 cassette_id
24 sample_score_cutoff
24 data_redundancy
24 calibration_image_file
30 exists
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
36 run_log
37 write
38 str(sample_score_cutoff)
38 write
38 str(sample_score_cutoff)
49 str.format
49 sample_spreadsheet_file
50 spreadsheet_rows
cassette_q55_spreadsheet.csv
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format 51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
72 str.format
72 write
73 open
73 rejection_log
74 str.format
74 TextIOWrapper.write
50 spreadsheet_rows
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format
51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
calibration.img
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format 93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format 106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format 93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
50 spreadsheet_rows
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format
51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open119 collection_log_file 120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open 119 collection_log_file 120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format 106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file 120 module.writer120 collection_log
121 writer.writerow
92 collect_next_image
50 spreadsheet_rows
128 return
run/run_log.txt
run/rejected_samples.txt
run/raw/q55/DRT240/e10000/image_001.raw
run/data/DRT240/DRT240_10000eV_001.img
run/collected_images.csv
run/raw/q55/DRT240/e10000/image_002.raw
run/data/DRT240/DRT240_10000eV_002.img
run/raw/q55/DRT240/e11000/image_001.raw
run/data/DRT240/DRT240_11000eV_001.img
run/raw/q55/DRT240/e11000/image_002.raw
run/data/DRT240/DRT240_11000eV_002.img
run/raw/q55/DRT240/e12000/image_001.raw
run/data/DRT240/DRT240_12000eV_001.img
run/raw/q55/DRT240/e12000/image_002.raw
run/data/DRT240/DRT240_12000eV_002.img
run/raw/q55/DRT322/e10000/image_001.raw
run/data/DRT322/DRT322_10000eV_001.img
run/raw/q55/DRT322/e10000/image_002.raw
run/data/DRT322/DRT322_10000eV_002.img
run/raw/q55/DRT322/e11000/image_001.raw
run/data/DRT322/DRT322_11000eV_001.img
run/raw/q55/DRT322/e11000/image_002.raw
run/data/DRT322/DRT322_11000eV_002.img
simulate_data_collection
230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>
251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])
251 args = ['q55']
251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>
24 cassette_id = 'q55'
24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0
24 calibration_image_file = 'calibration.img'
49 str.format
49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'
50 spreadsheet_rows(sample_spreadsheet_file)
50 sample_name = 'DRT240'50 sample_quality = 45
61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])
61 accepted_sample = 'DRT240'61 num_images = 2
61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'
92 collect_next_image(casset ... _{frame_number:03d}.raw')
92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'
106 str.format
106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')
calibration.img
run/data/DRT240/DRT240_11000eV_002.img
lineagequerylineagequery
YesWorkflow:Conceptual workflowmodel
noWorkflow:Python tracemodel
Howtobridgethisgap?
WouldliketouseYWmodeltoqueryNW
data!
Research-Showcase'16