analysing large data sets using formal concept lattices

24
Analysing Large Data Analysing Large Data Sets using Formal Sets using Formal Concept Lattices Concept Lattices Simon Andrews and Constantinos Simon Andrews and Constantinos Orphanides Orphanides { { s.andrews s.andrews , , c.orphanides c.orphanides } @shu.ac.uk } @shu.ac.uk Conceptual Structures Research Conceptual Structures Research Group Group Communication and Computing Research Communication and Computing Research

Upload: yorick

Post on 19-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Analysing Large Data Sets using Formal Concept Lattices. Simon Andrews and Constantinos Orphanides { s.andrews , c.orphanides } @ shu.ac.uk. Conceptual Structures Research Group Communication and Computing Research Centre. Acknowledgement. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Analysing Large Data Sets using Formal Concept Lattices

Analysing Large Data Sets Analysing Large Data Sets using Formal Concept Latticesusing Formal Concept Lattices

Simon Andrews and Constantinos OrphanidesSimon Andrews and Constantinos Orphanides

{{s.andrewss.andrews, , c.orphanidesc.orphanides} @shu.ac.uk} @shu.ac.uk

Conceptual Structures Research GroupConceptual Structures Research GroupCommunication and Computing Research CentreCommunication and Computing Research Centre

Page 2: Analysing Large Data Sets using Formal Concept Lattices

Acknowledgement

This work is part of the CUBIST project ("Combining and Uniting Business Intelligence with Semantic Technologies"), funded by the European Commission's 7th Framework Programme of ICT, under topic 4.3: Intelligent Information Management.

Page 3: Analysing Large Data Sets using Formal Concept Lattices

• A variety of data sets can be converted into formal contexts:– Data Discretization– Data Booleanization

• However, issues arise:– Data of modest size can contain hundreds (of thousands)

of formal concepts, resulting in unmanageable and unreadable concept lattices.

– Density of, and noise in a context: factors that increase the number of formal concepts.

– Computation of formal concepts cannot be carried out, by much of the existing software, on a large scale.

Data Sets

Page 4: Analysing Large Data Sets using Formal Concept Lattices

• FcaBedrock (Formal Context Creator)– Creating sub-contexts by restricting the conversion of the data

to information of interest.

• In-Close (Fast Concept Miner)– By removing relatively small concepts from a context to reduce

"noise".

⇨ Production of readable, yet still meaningful, concept lattices.

Tools By The Authors

Page 5: Analysing Large Data Sets using Formal Concept Lattices

• A Formal Context Creator for Formal Concept Analysis, developed by the authors.

• Free and open-source at Sourceforge.• Input files supported: Flat-file CSV and Three-column

CSV (triples).• Output files supported: Burmeister (.cxt) and FIMI

(.dat).• User guided automation - the user has the final say on

how to interpret a data set.• Attributes supported: Categorical (aka many-valued,

nominal/ordinal), Boolean and Continuous.

FcaBedrock - Overview

Page 6: Analysing Large Data Sets using Formal Concept Lattices

• Auto-detection of metadata, directly from the data set, if desired.

• Support of both discrete (0-10, 10-20, …) and progressive (>10, >20, …) scaling for continuous attributes.

• Ability to exclude attributes from the analysis.• Ability to restrict the analysis to user-specified

attribute values.• Metadata of each conversion/analysis saved & stored for

subsequent conversions.• Repetition of metadata for similar attributes.

FcaBedrock - Overview

Page 7: Analysing Large Data Sets using Formal Concept Lattices

• A fast Concept Miner for Formal Concept Analysis, developed by one of the authors.

• Free and open-source at Sourceforge.• Input files supported: Burmeister (.cxt).• Minimum support for intent and extent.• Output of analysis data and concepts.• Output of sub-context ("noise" reduction).• Fast computation of formal concepts:

– Mining 1 million concepts per second.

In-Close - Overview

Page 8: Analysing Large Data Sets using Formal Concept Lattices

Analysis of Sub-Contexts: Agaricus-Lepiota

• Data Set: Agaricus Lepiota (aka Mushroom)– From UCI Machine Learning Repository– 8124 objects (mushrooms)– 23 attributes (mushroom properties)

• e.g. stalk shape, cap color, edible/poisonous…

– Attribute types: Categorical, Boolean– Processed by In-Close: 220,000+ concepts

Page 9: Analysing Large Data Sets using Formal Concept Lattices

Analysis of Sub-Contexts: Agaricus-Lepiota

• Lets us say we are interested in the relationship between mushroom habitat and population type.

• Using FcaBedrock:– Create a sub-context by only converting the habitat and

population type attributes.• ⇨ Down to 33 Formal Concepts (from 220,000+) and 13 Formal Attributes (from 125)

Page 10: Analysing Large Data Sets using Formal Concept Lattices

Visualisation of the sub-context in ConExp

Page 11: Analysing Large Data Sets using Formal Concept Lattices

Analysis of Sub-Contexts: Census Income

• Data Set: Census Income (aka Adult)– From UCI Machine Learning Repository– 32561 objects (adults)– 14 attributes (census data)

• e.g. age, sex, education, employment type…

– Attribute types: Categorical, Boolean, Continuous– Processed by In-Close: 100,000+ concepts

Page 12: Analysing Large Data Sets using Formal Concept Lattices

Analysis of Sub-Contexts: Census Income

• Lets us say we are interested in comparing how pay is effected by gender in adults who have had a higher education.

• Using FcaBedrock:– Create a sub-context by only converting the sex, class and

education attributes.

– Convert only those objects (adults) with the education attribute value Bachelors, Masters or Doctorate.⇨ Down to 7941 objects and 37 Formal Concepts

Page 13: Analysing Large Data Sets using Formal Concept Lattices

Visualisation of the sub-context in ConExp

Page 14: Analysing Large Data Sets using Formal Concept Lattices

In-Close: Concept Reduction• Using FcaBedrock's context reduction:

– Attributes of no particular interest can be excluded from the analysis (attribute exclusion).

– We can convert only those objects with specific attribute values (object exclusion).

• Introducing In-Close's concept reduction:– Using the well-known idea of minimum support

• Specifying a minimum number of objects and/or attributes for a concept.⇨ Reduction of 'noise' in a context.

Page 15: Analysing Large Data Sets using Formal Concept Lattices

In-Close: Concept Reduction• 'Noise': Concepts containing number of

attributes or objects smaller than the user-defined minimums.

• Reduction of 'noise' achieved by:– Semi-automated form of lattice 'iceberging'.

• Complete hierarchy maintained in the lattice.

– Mining a context for concepts that satisfy a minimum-support and then re-writing the context using only those concepts.

Page 16: Analysing Large Data Sets using Formal Concept Lattices

A Student Survey Example• Student survey data

– Demographic and 'problem' data from 587 university undergraduates.

– Yes/No responses to 36 problems that a student may have experienced during their studies:

• missing lectures, low performance, etc.

• Noisy data set:– 145 Formal Attributes– Processed by In-Close: 22,760,243 concepts!

Page 17: Analysing Large Data Sets using Formal Concept Lattices

A Student Survey Example• Let us say we are only interested in analysing

the 'problem' data. • Using FcaBedrock:

– Convert only these attributes, exclude demographics.– Remaining concepts: 339,672

• Significant reduction, but still too many!

• Adding In-Close to the equation:– Set minimum size of intent to 4 and minimum size of

extent to 80.⇨ Remaining concepts: 32!

Page 18: Analysing Large Data Sets using Formal Concept Lattices

Visualisation of the sub-context in ConExp

Page 19: Analysing Large Data Sets using Formal Concept Lattices

Comparing Quiet Sub-Contexts• Data Set: Agaricus-Lepiota (aka Mushroom)• Using FcaBedrock:

– Create two sub-contexts: one for edible mushrooms and one for poisonous mushrooms.

• Using In-Close (for each sub-context):– Set minimum size of intent to 10.⇨ 2,848 objects + 17 concepts for the edible sub-

context, 3,344 objects + 14 concepts for the poisonous sub-context.

Page 20: Analysing Large Data Sets using Formal Concept Lattices

Comparing Quiet Sub-Contexts

•Similarities between the two lattices:– Attributes expressed in both lattices were moved to

the right of each lattice.

•Differences between the two lattices:– Attributes expressed in only one lattice and were

moved to the left.⇨ Clear visualisation for comparison.

Page 21: Analysing Large Data Sets using Formal Concept Lattices

Edible mushroom lattice in ConExp

Page 22: Analysing Large Data Sets using Formal Concept Lattices

Poisonous mushroom lattice in ConExp

Page 23: Analysing Large Data Sets using Formal Concept Lattices

• Large data sets may be difficult to deal with computationally, but:– It is the number of formal concepts derived from a data set

that is the key factor in determining if a concept lattice will be useful as a visualisation.

• Readable lattices can be produced with a straightforward process of:– creating sub-contexts– reducing noise

• Freely available software.• Burmeister (.cxt) format used to succesfully interoperate

between the three FCA tools.

Conclusion

Page 24: Analysing Large Data Sets using Formal Concept Lattices

Thank you very much.Thank you very much.

QuestionsQuestions??