performance improvements
DESCRIPTION
Performance improvement in OpenOffice.orgTRANSCRIPT
- 1. Niklas Nebel Sun Microsystems PERFORMANCE IMPROVEMENTS IN CALC
2. Agenda
- Introduction and context 3. Local optimizations 4. Handling sheets separately 5. DataPilot performance 6. Load & save outlook
7. Introduction and Context 8. Performance work in all of OOo
- Performance project
- Big improvements from 3.0 to 3.2
- Start-up: Cold start of Writer 20% faster 9. Writer load
performance
- Comparable with MS Word 2007
- Impress load performance
- Comparable with MS PowerPoint 2007
- Calc performance
- Load and save: Up to twice as fast 10. Recalculation: Up to 20 times faster (extreme case)
11. Local Optimizations 12. API Usage When Saving Text Cells
- Filter uses getFormula API method 13. Single quote character added if text can be parsed as a number 14. Unnecessary parsing step 15. Can take up to 17% of CPU time
16. Querying the Document Null Date
- Internal representation: Days since the null date 17. File
format: XML Schema dates ( ISO 8601) 18. Utility method for
conversion
- Queries the null date from the document 19. Several UNO calls
- Querying once is enough 20. 10% of CPU time if only date cells are used
21. Collecting Formatted Cell Ranges
- Collect cell ranges with equal cell formats
- For generating automatic styles 22. Keep a list of ranges for each set of formats 23. Try to join adjacent ranges
- Formats are kept and iterated column-wise
- Can use this information when trying to join
- Prevents pathological cases
24. Formula Optimizations
- String handling when formuas are parsed
- Functions, references, names are case-insensitive 25.
Operators, separators, parentheses are not 26. Reduce case
conversion calls
- 5% of CPU time saved
- Functions, references, names are case-insensitive 25.
Operators, separators, parentheses are not 26. Reduce case
conversion calls
- Sorting of values for MEDIAN etc.
- Not necessary to completely sort the array 27. Use std::nth_element STL method instead 28. Faster calculation after loading
29. Formula Recalculation (1)
- Detection of duplicate notifications
- When a cell range is modified 30. Parameter range can contain several changed cells 31. Notify each range only once
- Also useful for single-cell change
- Parameter range can contain several changed results 32. Extreme case: Issue 95967 20x faster
33. Handling Sheets Separately 34. Updating Row Heights
- Optimal row height depends on local conditions
- Especially fonts
- Core structures need concrete height values
- Positioning of shapes: Whole file
- File format: relative to cell position 35. Internally: absolute positions
- Screen output: Only single sheet
- Positioning of shapes: Whole file
- Update row heights
- After loading: Visible sheet and sheets with shapes 36. Others as needed (display, printing, )
37. Updating Row Heights: Comments
- Cell comments (formerly: notes) are shapes 38. Often used in
large sheets
- Usually not shown
- Create shape only when comment is shown
- Saves time if there are many hidden comments 39. Row heights can be updated later
40. Updating Row Heights: Results
- No effect for single sheet 41. Little improvement for text and numbers 42. 30% CPU time with date cells on many sheets 43. Formula results don't have to be calculated
44. Partial Saving
- Don't generate XML elements for whole file 45. Copy unchanged
parts on stream level 46. Could copy from temporary storage
- Storage layer creates copy of the unpacked file
- Access the original file
- Uncompress on the fly
- Cost
- File access: Read the compressed file 47. CPU: Uncompress
48. Experiment: Incremental Saving
- Generate XML elements only for changed cells
- Proof of concept: Only single-cell changes
- No additional information kept after loading 49. Minimal
parsing to find affected cells in stream
- Takes extra time 50. Less if affected cells near start of file
- Results (compared to 3.0):
- 40 70% improvement in CPU time 51. 30 50% improvement in total time
52. Sheet-Wise Saving
- Handle sheets instead of individual cells 53. Fewer sheets than
cells
- Additional information can be kept in memory
- Easier to find modified sheets than modified cells 54. One
obvious limitation:
- Only useful with several sheets
55. Finding Modified Sheets
- Few code changes for most types of changes
- Formula notification for cell contents 56. Formula calculation for changed results 57. Cell format changes 58. Column widths or row heights 59. Handled separately: Print ranges, etc.
- Currently no handling of drawing layer changes
- All sheets are considered modified
60. Automatic Styles
- Direct formats are collected in automatic styles
- Referenced by name
- Generated name (ce1 etc.)
- One list for the whole document 61. Have to be created with the same names again
- Referenced by name
- Implemented for cell contents (incl. comments)
- Keep a mapping of names to cell/text positions 62. Collect styles for unchanged sheets first 63. Include in existing duplicate detection for other sheets
- Sheets with shapes always saved normally
64. Putting the Parts Together
- When loading a file
- Compatibility checks: Namespaces, encoding 65. Keep stream positions and style information
- Steps to save a spreadsheet document
- meta.xml, styles.xml, embedded objects: as usual 66.
content.xml
- Generate common content and modified sheets 67. For each sheet: Generate or copy stream portion
- For Save and Save As update stream positions
- meta.xml, styles.xml, embedded objects: as usual 66.
content.xml
68. Results
- Influencing factors
- Unchanged sheets 69. Type of sheet content 70. CPU time / file access
- Example
- Text, numbers, dates 71. 16 sheets
- Single sheet modified
- Twice as fast 72. On top of other changes
73. Formula Recalculation (2)
- Sheet area is divided into slots
- 16 columns by 128 rows 74. Range dependency registered in all affected slots 75. Needs attention when row limit is changed
- Change: Use hash_set instead of set
- Faster modification of dependency structures 76. Loading time
- Change: Separate structures per sheet
- Faster recalculation if several sheets are used
77. DataPilot Performance 78. DataPilot Memory Usage
- Issue 55266: Several fields with many items 79. Fix now under
way from IBM Symphony team
- Don't allocate results for all child items 80. New cache table
- CWS datapilotperf
- Planned for OOo 3.3 81. Combination of large fields no longer a limitation
82. Load & Save Outlook 83. DOM Usage
- Prototype by Christian Lippka for Impress
- Use fast SAX to fill a compact DOM representation 84. Import from DOM, possibly parallel to parsing
- Results for Impress
- Only 2% improvement for typical presentation 85. Filling DOM tree uses 2% of CPU time 86. Not worth the effort
- Calc may be different
- Larger number of XML elements 87. But: Memory usage twice the XML stream size
88. Further Separation of Sheets
- Load only the visible sheet
- Load other sheets as needed, or in background 89. Parse XML
fragment from stream, or use DOM 90. Formulas, charts may depend on
changed cells
- Dependencies must be known before saving
- Load other sheets as needed, or in background 89. Parse XML
fragment from stream, or use DOM 90. Formulas, charts may depend on
changed cells
- Parse formulas only as needed
- Per sheet or individually 91. Already a separate step (but for all formulas)
- Handle several sheets in parallel
- More fine-grained locking needed
92. Q & A 93. PERFORMANCE IMPROVEMENTS IN CALC Niklas Nebel [email_address]