topic 1: policy analytics: comparing and analyzing insights from … · topic 1: policy analytics:...

Master thesis subjects

Topic 1: Policy Analytics: Comparing and Analyzing Insights from e-Participation Platforms and Social Media

Promotor Anthony Simonofski

Summary The impact of citizens on the decisions taken by political representatives in general is labeled as “citizen participation” and is not new. This participation can be further stimulated through the use of Information and Communication Technologies (ICT), making it more accessible but also cost-efficient. The use of ICT to support participation is labeled as “e-participation”. Two of the most popular channels are social media (Facebook, Twitter, etc.) and dedicated e-participation platforms (e.g.: LeuvenMaaktHetMee). However, the ideas, comments, discussions of citizens on these two channels generate a lot of data to be processed by political representatives afterwards. The goal of this master’s thesis is to: • Identify citizens’ requirements towards the themes they would like to discuss on both channels; • Investigate and apply relevant data analysis techniques to make sense of the data on both channels for policy-makers; • Compare the themes discussed social media and e-participation platforms in a specific case (contacts available in Leuven, Mons or Liège).

Explanation /

Reading list • Cronemberger, F., & Gil-Garcia, J. (2020). Problem Conceptualization as a Foundation of Data Analytics in Local Governments: Lessons from the City of Syracuse, New York. In Proceedings of the 53rd Hawaii International Conference on System Sciences. • Belkahla Driss, O., Mellouli, S., & Trabelsi, Z. (2019). From citizens to government policy-makers: Social media data analysis. Government Information Quarterly, 36(3), 560–570. • Lago, N., Durieux, M., Pouleur, J.-A., Scoubeau, C., Elsen, C., & Schelings, C. (2019). Citizen Participation through Digital Platforms: the Challenging Question of Data Processing for Cities. SMART 2019 : The Eighth International Conference on Smart Cities, Systems,

Devices and Technologies, (August), 19–25

Prerequisites None

Topic 2: Clustering association rules for data quality improvement

Promotor Bart Baesens

Summary The aim of this project is to research the applicability of Association Rule Mining in the context of data quality (data cleaning). Data cleaning techniques are interested in finding dependencies in data that hold for approximately all data. Association Rule mining can be used to find such patterns in the data, but it tends to result in a large list of potential rules which makes it hard for domain experts to validate them. The goal of this thesis is to study how clustering techniques can be used to reduce this set of candidate rules to a more manageable size.

Explanation /

Reading list /

Prerequisites /

Topic 3: Explaining anomalies found by Isolation Forest


Summary Isolation Forest is a very performant algorithm for finding outliers in a data set and can be used in many problem domains (fraud detection, adversarial attacks,…). The results of most outlier detection algorithms, however, are hard to interpret for end users or business stakeholders. The goal of this thesis is to research and compare several strategies for making predictions of an isolation forest algorithm more interpretable for domain experts. This can be done by approximating the algorithm output with a more interpretable model or by using post-hoc techniques from the domain of explainable AI (Shapley values, LIME, RuleFit,…) or by leveraging the internal structure of an isolation forest. You can apply the techniques on the Http dataset (or another dataset of choice).

Explanation /

Reading list F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation Forest,” in 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, Dec. 2008, pp. 413–422, doi: 10.1109/ICDM.2008.17. ; C. Molnar, Interpretable Machine Learning.

Prerequisites /

Topic 4: Leveraging Social Media Data for Bankruptcy Prediction Models


Summary Extract social media data from, e.g. Twitter, Facebook or Instagram, and distill relevant information for predicting a company's financial health (e.g. bankruptcy, revenue growth). Develop a model that incorporates financial ratios, as well as the extracted social media data.

Explanation /

Reading list Bollen, J., Mao, H., & Zeng, X. (2011). Twitter mood predicts the stock market. Journal of Computational Science, 2(1), 1–8. https://doi.org/10.1016/j.jocs.2010.12.007 Jung, S. H., & Jeong, Y. J. (2020). Twitter data analytical methodology development for prediction of start-up firms’ social media marketing level. Technology in Society, 63, 101409. https://doi.org/10.1016/j.techsoc.2020.101409 Liang, D., Lu, C. C., Tsai, C. F., & Shih, G. A. (2016). Financial ratios and corporate governance indicators in bankruptcy prediction: A comprehensive study. European Journal of Operational Research, 252(2), 561–572. https://doi.org/10.1016/j.ejor.2016.01.012 Sun, A., Lachanski, M., & Fabozzi, F. J. (2016). Trade the tweet: Social media text mining and sparse matrix factorization for stock market prediction. Intern

Prerequisites Predictive modelling in Python / R

Topic 5: ProfLogit for Credit Scoring


Summary The idea is to compare ProfLogit with other machine learning techniques for credit default modeling.

Explanation ProfLogit is a recently developed extension of logistic regression which directly optimizes profit instead of a maximum likelihood function. In this dissertation, you will compare ProfLogit with a selection of state of the art machine learning techniques (e.g., logistic regression, XGBoost, random forests) on a selection of real-life credit scoring data sets. You will use various performance criteria and a robust statistical evaluation framework for performance comparison.

Reading list STRIPLING E., VANDEN BROUCKE S., ANTONIO K., BAESENS B., SNOECK M., Profit Maximizing Logistic Model for Customer Churn Prediction Using Genetic Algorithms, Swarm and Evolutionary Computation, forthcoming, 2017. VERBRAKEN T., VERBEKE W. BAESENS B., A Novel Profit Maximizing Metric for Measuring Classification Performance of Customer Churn Prediction Models, IEEE Transactions on Knowledge and Data Engineering, Volume 25, Issue 5, pp. 961-973, 2013.

Prerequisites Programming experience (in e.g., Java, Python or R)

Topic 6: Learning to rank for cost-sensitive credit card fraud detection


Summary Credit card fraud poses a significant problem for banks. Given a huge number of credit card transactions and limited resources, banks have to prioritize which transactions to check for fraud in order to minimize their losses. This thesis will tackle credit card fraud detection as a learning to rank problem. A novel method will developed and tested empirically.

Explanation The current state-of-the-art in credit card fraud detection are predictive models. These predict the probability of an instance being fraudulent. However, in reality the bank’s employees are time constrained and will have to prioritize transactions. Therefore, the goal of this thesis is to reformulate the fraud detection task as a learning to rank problem. Although this approach is typically used in an information retrieval context, it can also be applied to prioritize potentially fraudulent instances. The ranking should take into account both the probability of being fraudulent and the expected cost. Using this model, the bank's priorities can be set more appropriately to minimize fraud losses. In this thesis, the student will formulate fraud detection as a cost-sensitive learning to rank problem. Then, the proposed method will be empirically compared with existing methods on real data.

Reading list Li, H. (2011). A short introduction to learning to rank. IEICE TRANSACTIONS on Information and Systems, 94(10), 1854-1862.

Prerequisites Knowledge of Python, or a willingness to learn. Familiarity with machine learning techniques is a plus.

Topic 7: Data augmentation for tabular data: an empirical evaluation


Summary Data is crucial to train machine learning models, but is often expensive to acquire. Data augmentation refers to a collection of techniques that slightly transform the available data to create new synthetic data. This can be leveraged to improve performance of machine learning models. In this thesis, the student will give an overview of existing techniques for data augmentation that apply to tabular data. These will then be benchmarked empirically against each other. Furthermore, the student is encouraged to come up with new ideas, or translate ideas from a different domain (e.g. computer vision) to tabular data.

Explanation Modern machine learning methods require large amounts of data to perform well. In reality, gathering more data is often expensive or even impossible. In computer vision, much work has been done to come up with data augmentation techniques. These create new, synthetical data by (slightly) modifying existing data. However, most applications in the real world rely on tabular data. For these, data augmentation is not as straightforward as this data is its structure is less intuitive for humans to understand. Nevertheless, specific techniques do exist (e.g. SMOTE) or are applicable to tabular data (e.g. adding noise). In this thesis, the student will analyze the existing techniques and categorize them in a literature review. Next, these should be compared in an empirical evaluation. The student is encouraged to come up with and implement new ideas.

Reading list - Yoon, J., Zhang, Y., Jordon, J., & van der Schaar, M. (2020). VIME: Extending the Success of Self-and Semi-supervised Learning to Tabular Domain. Advances in Neural Information Processing Systems, 33. - Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of Big Data, 6(1), 1-48.


Topic 8: Development of event abstraction visualisation techniques for process mining on IoT sensor data

Promotor Estefanía Serral Asensio

Summary Event abstraction techniques make it possible to extract more complex information (events of the process) from low-level data (e.g. sensor data) in process mining. However, it is often difficult to understand the links between low-level data and higher-level information. In this thesis, students will explore ways to visualise these links to help improve the understanding of the process.

Explanation Like other mining techniques, process mining relies on having suitable data, i.e. an event log. However, data available in traditional DBs are not always readily usable for process mining, and new data sources like e.g. IoT devices (sensors) are often not at the right abstraction level. For this reason, event abstraction techniques, such as complex event processing or ontology-based data access, are more and more often used to extract meaningful business information from low-level data (e.g. based on the temperature in a fridge, you can deduct if the goods stored in it are still fresh or not). The links that are made with event abstraction between the raw data sources and the event log (and the process model that can be extracted from the event log) are typically difficult to understand. However, such an understanding would be valuable information for the business to understand the process better and to find improvement possibilities. The goal of this thesis is to explore ways to visualise the links between low-level data, events in the event log, and the resulting process model.

Reading list van Zelst SJ, Mannhardt F, de Leoni M, Koschmider A (2020) Event abstraction in process mining: literature review and taxonomy. Granul Comput. https://doi.org/10.1007/s41066-020-00226-2 Calvanese, Diego, et al. "OBDA for log extraction in process mining." Reasoning Web International Summer School. Springer, Cham, 2017 Diba, Kiarash, et al. "Extraction, correlation, and abstraction of event data for process mining." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10.3 (2020): e1346

Prerequisites Programming skills, knowledge of process modelling

Topic 9: Concurrency in Java: Project Loom

Promotor Ferdi Put

Summary Keywords: concurrency, multithreading, native threads, green threads, coroutines, Java

Explanation Concurrency is one of the central themes in the course Systeemsoftware [D0I63a]. The historical evolution from single threaded applications (without concurrency), over coroutines, to green threads (user threads) managed and scheduled in user space by the runtime system, and finally native threads (kernel threads) scheduled by the OS kernel, was analyzed in depth in this course. From its origin, Java always has supported multithreading. Support for green threads was very soon abandoned in favor of the more advanced (but also more resource consuming) native threads. The recent Project Loom, however, introduces fibers (the older green threads) and continuations (coroutines) in Java. In this way, it should become possible to introduce an actor-based approach. Concurrency should be more simple to program, use less resources, and be more performant. Purpose of this master project is to evaluate the possibilities and performance of this new “old” approach.

Reading list J. Bacon & T. Harris, Operating systems – concurrent and distributed software design, Addison-Wesley, 2003 OpenJDK, Loom – fibers, continuations and tail-calls for the JVM, https://openjdk.java.net/projects/loom/

Prerequisites Systeemsoftware [D0I63a]

Topic 10: Session-based routing

Promotor Ferdi Put

Summary session-oriented routing makes packet transmission simpler and more transparent, while offering improved security, control and agility

Explanation A session is a two-way exchange of information and is comprised of related flows in both directions. Today, almost every network involves bi-directional sessions to move packets, and nearly all of the advanced service functions are required to have an understanding of and control over network sessions. A session-oriented router is claimed to make packet transmission fundamentally simpler and more transparent, while offering improved security, control and agility. Using sessions enables these benefits because the software is intelligent enough to dynamically optimize how and where packets travel through the network. Session management has traditionally been done higher up on the open system interconnection (OSI) stack by the endpoints communicating with each other, and not aware of all the other sessions on the network. Layer 3 session awareness enables the router to dynamically manage all sessions going across a network in an intelligent way and provide end-to-end visibility. Purpose of this master project is to investigate how session-based routing deviates from the existing Internet routing model, its relationship with software defined networking (SDN), and the feasibility of session-based routing.

Reading list TechTarget, Session-based routing, January 2019, https://searchnetworking.techtarget.com/definition/session-based-routing 128 Technology: Why session-based routers will fix the Internet, IT Business Edge, July 14, 2016 P. MeLampy, Session-based routing holds the key to the Internet’s future, Network World, November 18, 2016, https://www.networkworld.com/article/3142643/session-based-routing-holds-the-key-to-the-internets-future.html Peterson & Davie, Computer Networks – a systems approach, Release Version 6.1

Prerequisites Computer networks

Topic 11: Managing recursively virtualized networks

Promotor Ferdi Put

Summary Keywords: network management, network virtualization, software defined networking (SDN)

Explanation For almost as long as there have been packet-switched networks, there have been ideas about how to virtualize them, starting with virtual circuits. VPNs were one early success for virtual networking. They allowed carriers to present corporate customers with the illusion that they had their own private network. VLANS are how we typically virtualize an L2 network. VXLAN encapsulates a virtual Ethernet frame inside a UDP packet. The powerful thing about virtualization is that when done right, it should be possible to nest one virtualized resource inside another virtualized resource. VXLAN, for example, makes it possible to have multiple VLANs encapsulated in a VXLAN overlay, which in turn is encapsulated in a VLAN. The hard part is grappling with the idea of virtual networks being nested (encapsulated) inside virtual networks. The other challenge is understanding how to automate the creation, management, migration, and deletion of virtual networks, and on this front there is still a lot of room for improvement. Mastering this challenge will be at the heart of networking in the next decade. (text fragments taken from Peterson & Davie, pp. 175-177)

Reading list Peterson & Davie, Computer Networks – a systems approach, Release Version 6.1)


Topic 12: Open Compute Project (OCP): white-box switches

Promotor Ferdi Put

Summary Keywords: network equipment, white-box switches, switch design

Explanation Recent advances in domain specific processors and other commodity components (SRAM, TCAM, SFP, SFI, ...) have made it possible that today, anyone can build a high-performance switch by pulling the blueprint off the web, and run some open source L2 and L3 stacks available on GitHub on this home-built switch.. These switches are called “open white-box” switches, to contrast them with closed “black-box” devices that have historically dominated the industry. The beauty of this new switch design is that a given white-box can now be programmed to be an L2 switch, and L3 router, or a combination of both, just by a matter of programming. Internally, the white-box switch uses a domain-specific network processing unit (NPU) with an architecture and instruction set that has been optimized for processing packet headers. This NPU takes advantage of fast SRAM-based memory buffers, TCAM-based table lookup, and a forwarding pipeline implemented by an ASIC. A multi-stage pipeline allows concurrent processing of multiple packets. Finally, other commodity components that make this all practical: small form-factor pluggable transceivers (SFP+) connected over a standardized bus (SFI). (text fragments taken from Peterson & Davie, pp. 172-174)

Reading list Peterson & Davie, Computer Networks – a systems approach, Release Version 6.1)


Topic 13: Cryptography in Java

Promotor Ferdi Put

Summary Keywords: cryptography, Java

Explanation -

Reading list Peterson & Davie, Computer Networks – a systems approach, Release Version 6.1, pp. 412-414 Oracle, Java Cryptography Architecture (JCA) Reference Guide, https://docs.oracle.com/javase/8/docs/technotes/guides/security/crypto/CryptoSpec.html M. Sheth, Encryption and decryption in Java cryptography, https://www.veracode.com/blog/research/encryption-and-decryption-java-cryptography#tldr


Topic 14: DNS-based security

Promotor Ferdi Put

Summary Keywords: DNS, network security

Explanation Does malware use DNS to gain command and control, exfiltrate data, or redirect web traffic? When internet requests are resolved by a recursive DNS service, they become the perfect place to check for and block malicious or inappropriate domains and IP addresses. An important opportunity can be missed if DNS is not monitored for indications of compromise. Proactive DNS-layer security should be a core component of a security strategies. At the same time, it can be investigated whether the recently developed DNS security extensions (DNSSEC) are important for such a DNS-based security strategy.

Reading list L. Peterson & B. Davie, Computer Networks – a systems approach, Release Version 6.1, pp. 449-457 Cisco Umbrella, Core - DNS - Cybersecurity for remote workers, Cisco, 2020, https://media.bitpipe.com/io_15x/io_150726/item_2104445/Cybersecurity%20for%20Remote%20Workers%20How%20to%20Secure%20Every%20Device%20Everywhere.pdf M. Elias, DNS-Based Security – Who Are You Kidding?, 18 December 2018, https://www.allot.com/blog/risks_dns_based_security/ R. Arends, R. Austein, M. Larson, D. Massey & S. Rose, DNS Security Introduction and Requirements, IETF RFC 4033, March 2005


Topic 15: Software Defined Perimeter and Zero Trust Security model

Promotor Ferdi Put

Summary Keywords: network security, VPN, SDP, zero-trust model

Explanation The traditional enterprise network is separated from the outside world by a fixed perimeter that consists of a series of firewall functions that block external users from coming in, but allows internal users to get out. The weaknesses of this traditional fixed perimeter model are becoming problematic when considering developments like user-managed devices, phishing attacks, SaaS, IoT, ... Software Defined Perimeter (SDP) addresses this issue by providing the ability to deploy perimeters anywhere – on the internet, in the cloud, on the corporate network, or across all of these locations. Instead of assuming everything behind the corporate firewall is safe, the Zero Trust model assumes breach and verifies each request as though it originates from an open network. Regardless of where the request originates or what resource it accesses, Zero Trust teaches us to “never trust, always verify.” Every access request is fully authenticated, authorized, and encrypted before granting access. Micro-segmentation and least privileged access principles are applied to minimize lateral movement. Rich intelligence and analytics are utilized to detect and respond to anomalies in real time.

Reading list Peterson & Davie, Computer Networks – a systems approach, Release Version 6.1, Chapter 8 Microsoft, Enable a remote workforce by embracing Zero Trust security, https://www.microsoft.com/en-us/security/business/zero-trust M. Goss, SDP vs. VPN vs. zero-trust networks: What's the difference? https://searchnetworking.techtarget.com/feature/SDP-vs-VPN-vs-zero-trust-networks-Whats-the-difference? Wikipedia, Software Defined Perimeter, https://en.wikipedia.org/wiki/Software_Defined_Perimeter Pulse Secure, Demystifying Zero Trust Network Access (ZTNA), Pulse Secure, 2019


Topic 16: Blockchain-based self-sovereign identity: compliance with OAuth 2.0

Promotor Ferdi Put

Summary Keywords: identity management, authentication, authorization, blockchain, OAuth 2.0

Explanation Self-sovereign identity (SSI) is based on blockchain technology, but until now does not have wide public use because of its low compatibility and inconvenience. Purpose of this master project is to investigate the compliance of an SSI model with the popular and mature OAuth 2.0 standard. (Text based on Seongho Hong & Heeyoul Kim.)

Reading list Peterson & Davie, Computer Networks – a systems approach, Release Version 6.1, pp. 412-414 A. Preukschat & D. Reed, Self-Sovereign Identity - decentralized digital identity and verifiable credentials, Manning, MEAP version 7, 2020 J. Richer & A. Sanso, OAuth2 in action, Manning, 2017 A. Mühle, A. Grüner, et al, A survey on essential components of a self-sovereign identity, Computer Science Review, 30, November 2018, pp. 80-86 Seongho Hong & Heeyoul Kim, VaultPoint: A blockchain-based SSI model that complies with OAuth 2.0, Electronics, 2020, 9, 1231, https://www.mdpi.com/2079-9292/9/8/1231/pdf


Topic 17: Investigating the rise of CDO roles in companies – BOARDEX database, analytics of text-based dataset

Promotor Jan Vanthienen

Summary Digital transformation, the transformation happening in society and business due to the increased impact of digital technologies, requires strong leadership. Many companies give this responsibility to a new role, the chief digital officer (CDO). In this thesis, you will investigate the rise of CDO roles in companies by using the BOARDEX database. In this database, you can find CDO appointments, their backgrounds, connections and so forth. The goal is to find out which type of companies introduce CDO roles, why, and who. You can link this information with other databases or company announcements. More information about the BOARDEX database can be found here: https://bib.kuleuven.be/ebib/collectie/data/databanken/wrds

Explanation /

Reading list /

Prerequisites none

Topic 18: Digital transformation in services - interviews


Summary In this thesis, you will conduct semi structured interviews with companies that are currently working on the digital transformation of their services (Advice, customer service, insurance, … ). The goal is to sketch the general transformation happening and find several guidelines on how companies can start with digitalizing their services, and potentially some sort of maturity levels of digital service transformation. You have to propose some interesting companies or industries yourselves, after reading the literature.

Explanation /

Reading list /

Prerequisites none

Topic 19: Test case generation for decision models: DMN based conformance checking


Summary Creating test cases is of primordial importance when testing which behaviour is allowed in an environment. In this thesis you will have to find a way to generate a wide variety of test scenarios that a DMN model allows.

Explanation Conformance checking is widely studied topic as it checks wheter the observed behaviour is also allowed by the model. The problem is that one can only check behaviour that is also observed in the process log. In order to test exhaustively which scenario's the model allows it is therefore important to create test scenarios that the process model allows. In this thesis, you will have to develop test scenarios automatically for the decision modelling language DMN.

Reading list • Boonmepipit, B., & Suwannasart, T. (2019). Test case generation from BPMN with DMN. ACM International Conference Proceeding Series, 92–96. https://doi.org/10.1145/3374549.3374582

Prerequisites • Knowledge of decision models • Programming knowledge or willing to learn

Topic 20: Compare several DMN models on several challenges on a large amount of people.


Summary A survey project, Given a set of modeling guidelines and to model a decision scenario into decision model using a DMN standard. Many human modelers take a varied approach for modeling. It would be a good thought to look into and compare the differences in the models that were created.

Explanation A online based survey methodology is to be designed. Questionnare needs to be formulated to capture the modellers intention on understandability, consistensy and completeness of the models they created. perform exploratory data analysis on the data gathered. Further more interview practitioners to validate the findings.

Reading list Mendling, J., Reijers, H., & Cardoso, J. (2007). What Makes Process Models Understandable? BPM.

Prerequisites * Survey methodology, Willingnes to work in the summer

Topic 21: A user-friendly conversational agent to improve decision support for online businesses


Summary Coversational agents are the current focus of research in the areas of Human computer Interaction and Decision support systems. A new line of research could be in improving the decision support performance for the end users by integrating voice/text based seemless interaction with knowledge based decision execution framworks such as IDP system.

Explanation Build an intereactive chat bot that aid decision making with voice or text based interactions with the users. Propose a novel methodology to integrate the existing NLP techniques and Decision executeres. Bot should be able to gather the required inputs and deal with missing inputs. Bot should execute decision models on IDP system.

Reading list * Alman, A., Balder, K.J., Maggi, F.M., & Aa, H.V. (2020). Declo: A Chatbot for User-friendly Specification of Declarative Process Models. BPM. * Aa, H.V., Balder, K.J., Maggi, F.M., & Nolte, A. (2020). Say It in Your Own Words: Defining Declarative Process Models Using Speech Recognition. BPM. *Amato, F., Marrone, S., Moscato, V., Piantadosi, G., Picariello, A., & Sansone, C. (2017). Chatbots Meet eHealth: Automatizing Healthcare. WAIAH@AI*IA.

Prerequisites • Knowledge of Decision Modeling • Good programming knowledge

Topic 22: Identifying performance measures of DMN models


Summary Finding,evaluating and applying performance measures to DMN models.

Explanation A common evaluation method of process models is to run cases through the process model and see whether the model allows that behaviour. This is formalized in for example: Precision, Recall metrics. Currently, these metrics do not exist for DMN models. In this thesis you will have to identify which evaluation metrics can be used to measure the performance of a decision model and how they would apply to decision models.

Reading list Thomas Molka, David Redlich, Marc Drobek, Artur Caetano, Xiao-Jun Zeng, and Wasif Gilani. 2014. Conformance checking for BPMN-based process models. In Proceedings of the 29th Annual ACM Symposium on Applied Computing (SAC '14). Association for Computing Machinery, New York, NY, USA, 1406–1413. DOI:https://doi.org/10.1145/2554850.2555061

Prerequisites • Knowledge of Decision Modeling

Topic 23: Assisted DMN model creation from natural language descriptions


Summary Build a tool that automatically models a DMN model from highlighted natural language descriptions.

Explanation Even for experienced modellers modelling is an intensive and time consuming task. A.López et al.(2019) developed a tool that allows to highlight relevant text for declarative process models and model it automatically. Not only DECLARE would benefit of such a tool, also DMN would benefit from this. In this thesis, you will build such the same tool automatically create DMN models from highlighted texts.

Reading list H. A. López, M. Marquard, L. Muttenthaler and R. Strømsted, "Assisted Declarative Process Creation from Natural Language Descriptions," 2019 IEEE 23rd International Enterprise Distributed Object Computing Workshop (EDOCW), Paris, France, 2019, pp. 96-99, doi: 10.1109/EDOCW.2019.00027.

Prerequisites • Knowledge of process models • Good programming knowledge

Topic 24: Comparison and implementation of decision mining techniques on Event logs


Summary This thesis will require you to benchmark several decision mining techniques and draw conclusions.

Explanation Over the last few years, several decision mining techniques have been proposed with different performance. These techniques try to do the same as what certain process mining techniques do in order to identfy decisions. In this thesis, you will compare these decision mining techniques on performance level.

Reading list De Smedt, J., Hasić, F., vanden Broucke, S. K. L. M., & Vanthienen, J. (2019). Holistic discovery of decision models from process execution data. Knowledge-Based Systems, 183, 104866. https://doi.org/10.1016/j.knosys.2019.104866 Van Der Aalst, W., Weijters, T., & Maruster, L. (2004). Workflow mining: Discovering process models from event logs. IEEE Transactions on Knowledge and Data Engineering, 16(9), 1128–1142. https://doi.org/10.1109/TKDE.2004.47

Prerequisites • Good programming knowledge

Topic 25: Anti Money Laundering in Cryptocurrencies with Graph Neural Networks

Promotor Jochen De Weerdt

Summary As cryptocurrencies, like Bitcoin, Etherium, Litecoin, are becoming more popular and accepted across the globe, the challenges of securing coins and transactions also increase. In particular, Decentralized currencies are becoming popular among criminals because of their anonymity, complicating efforts by law enforcement agencies to track down individual transactions and link them to users. According to Europol, Bitcoin was used in 40% of illicit transactions in the EU (Malik, 2018). Due to its popularity, many financial organizations would like to offer their clients the possibility to buy, hold and pay with cryptocurrencies. However, financial organizations are often under stringent supervision of government agencies that require compliance with legislation. In particular, banks should prevent money laundering. Analyzing transactions in cryptocurrency for money laundering requires new tools and automated solutions. The purpose of this master's thesis is to investigate the possibilities of machine learning for anti-money laundering. You will start from an existing study which compares traditional machine learning with state-of-the-art graph neural networks. In this study, graph neural networks did not outperform traditional machine learning. Liu et al. introduce inconsistencies faced by graph neural networks when dealing with fraud data. These inconsistencies could explain the weak performance in the study by Weber et al. The goal of this thesis is to apply a modified GCN model that overcomes the inconsistencies and compare its performance with a plain vanilla GCN and traditional machine learning models. For this thesis you will work with an open source Bitcoin dataset.

Explanation /

Reading list - Weber, Mark, Giacomo Domeniconi, Jie Chen, Daniel Karl I. Weidele, Claudio Bellei, Tom Robinson, and Charles E. Leiserson. 2019. “Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics.” arXiv [[cs.SI](http://cs.si/)]. arXiv. [http://arxiv.org/abs/1908.02591](http://arxiv.org/abs/1908.02591). - Liu, Zhiwei, Yingtong Dou, Philip S. Yu, Yutong Deng, and Hao Peng. 2020. “Alleviating the Inconsistency Problem of Applying Graph Neural Network to Fraud Detection.” arXiv [[cs.SI](http://cs.si/)]. arXiv.

[http://arxiv.org/abs/2005.00625](http://arxiv.org/abs/2005.00625). - Malik, Nikita. 2018. “How Criminals And Terrorists Use Cryptocurrency: And How To Stop It.” Forbes Magazine, August. [https://www.forbes.com/sites/nikitamalik/2018/08/31/how-criminals-and-terrorists-use-cryptocurrency-and-how-to-stop-it/?sh=41e00d693990](https://www.forbes.com/sites/nikitamalik/2018/08/31/how-criminals-and-terrorists-use-cryptocurrency-and-how-to-stop-it/?sh=41e00d693990).

Prerequisites - Analytical mindset - Not afraid of working with real data and do some coding (Python) - Willing to learn about machine learning, neural networks and graph neural networks and cryptocurrencies

Topic 26: Graph Neural Networks for Fraud Detection: a Literature Study and Benchmarking


Summary Neural Networks (NN) are a class of machine learning algorithms that have proven very successful across various tasks from self-driving cars to face-ID. Neural networks are a subset of deep learning algorithms, which rely on multiple layers in the neural network. Because of this layered architecture, the neural network can progressively extract higher level features which eventually helps to solve the task at hand (e.g. recognize a face). More recently, neural networks have been applied to graphs or network data. Network data which contains nodes connected by edges is particularly hard to use in machine learning. A neural network can help to extract features from the network which can be used for the downstream machine learning task. The original NN algorithms have been adapted to work on graphs and are known collectively as Graph Neural Networks. An interesting and challenging use case for Graph Neural Networks is financial fraud detection/prediction. Credit card holders and shop owners are connected to each other by the transactions that are made between them. Hence, this network of transactions can be analyzed for suspicious behavior which might indicate fraudulent use of credit cards. The goal of this thesis is to explore the field of graph neural networks with a particular focus on financial fraud and anomaly detection. The main deliverable is an extensive literature review complemented by a comparative analysis of the most promising techniques on a real-life fraud dataset.

Explanation /

Reading list - [http://snap.stanford.edu/proj/embeddings-www/](http://snap.stanford.edu/proj/embeddings-www/) **** - Hamilton, William L. 2020. “Graph Representation Learning.” Synthesis Lectures on Artificial Intelligence and Machine Learning 14 (3): 1–159. - [https://github.com/safe-graph/graph-fraud-detection-papers](https://github.com/safe-graph/graph-fraud-detection-papers)

Prerequisites - The thesis might require some coding (in Python). However, you will primarily use code that is publicly available (e.g. Github). The code you will have to write will be limited. This

skill can be learned during the thesis. - Willing to learn about machine learning, neural networks and graph neural networks. - English writing skills + command of academic English or willingness to learn.

Topic 27: Catching Camouflaged Criminals with Node Representation Learning


Summary The shift from cash to plastic payments has been fueled by the rise of e-commerce and, more recently, by the global COVID pandemic. Despite increased safety and protection measures, credit card payments are still an important source of fraudulent activity. Catching fraudsters before they commit fraud is becoming increasingly difficult as fraudsters adapt to new security measures. In addition, fraudsters know how to hide themselves without raising suspicion. They will mimic the spending behavior of their victims, making it very hard to detect the fraud. In this thesis you will work on a real-life dataset in which you will have to track down criminals. To reveal these criminals, you will rely on Node Representation Learning techniques (Deepwalk). For these techniques you start from a graph/network representation of all transactions. For each node of the network a low-dimensional, continuous vector representation (called an embedding) is learned which captures the relational characteristics of the graph/network (i.e. neighbouring nodes will have similar vector representations). The goal of this thesis is to experiment with Shallow Node Representation Learning techniques to improve their performance in credit card fraud detection. The issue with traditional node embedding is that there are too many genuine transactions relative to the number of fraudulent transactions. Hence, can we change the network structure in such a way that more importance is given to the fraudulent transactions? Some ideas: - Use extreme edge weights - Add artificial nodes (i.e. a 'fraud' node which is connected to all known frauds) - Add time based weighting of edges - Leave out all licit transactions, build a network with illicit transactions only - Perform more random walks for fraudulent nodes, less for licit nodes.

Explanation /

Reading list - Pourhabibi, Tahereh, Kok-Leong Ong, Booi H. Kam, and Yee Ling Boo. 2020. “Fraud Detection: A Systematic Literature Review of Graph-Based Anomaly Detection Approaches.”

Decision Support Systems, April, 113303. - Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. 2014. “DeepWalk: Online Learning of Social Representations.” In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 701–10. KDD ’14. New York, NY, USA: Association for Computing Machinery. - “Card Fraud Losses Reach $27.85 Billion.” n.d. Accessed September 17, 2020. [https://nilsonreport.com/mention/407/1link/](https://nilsonreport.com/mention/407/1link/).

Prerequisites - Analytical mindset - Not afraid of working with real data and coding (Python) - Willing to learn about machine learning, credit card fraud and graph/node representation learning

Topic 28: Trace encoding and representation learning in process mining


Summary In this challenging thesis the students will experiment with different ways of obtaining encodings/vector representations/embeddings for process executions.

Explanation Process Mining tries to extract knowledge from business process event logs. In order to do this we need a way of representing the traces (process executions). Encoding puts the data in a format further data processing can use. This is often done by converting it into a feature space (often vector). In this master thesis the students will test different ways of encoding traces. They will use “classical methods” like alignments and methods borrowed from data mining an natural language processing like Word2Vec. A starting point will be following paper (https://www.researchgate.net/publication/348357644_Evaluating_Trace_Encoding_Methods_in_Process_Mining). In the ideal case the experiments will be supplemented with expanding the word2vec encodings to include multiple n-grams and/or with other encoding methods from deep learning (e.g encoder-decoder RNN). The students will also have to come up with a way to test the quality of the different embeddings for multiple tasks.

Reading list Evaluating Trace Encoding Methods in Process Mining https://www.researchgate.net/publication/348357644_Evaluating_Trace_Encoding_Methods_in_Process_Mining Act2vec, trace2vec, log2vec, and model2vec: Representation learning for business processes https://link.springer.com/chapter/10.1007/978-3-319-98648-7_18 Case2vec: Advances in Representation Learning for Business Processes https://www.researchgate.net/publication/344430983_Case2vec_Advances_in_Representation_Learning_for_Business_Processes Trace Alignment in Process Mining: Opportunities for Process Diagnostics https://www.researchgate.net/publication/202481415_Trace_Alignment_in_Process_Mining_Opportunities_for_Process_Diagnostics

Prerequisites • Good programming skills (e.g. Python) or willingness to learn • Knowledge of data analysis and Machine Learning or willingness to learn.

Topic 29: Watch out for the big fish: Instance-Dependent Cost-Sensitive Classification for Fraud Detection


Summary Fraud detection often involves immediate financial loss for the bank or client: For example, a fraudulent credit card transaction carries a specific financial loss depending on the amount. This thesis aims to compare specialized models that deals with classification but minimizing the overall cost of misclassification.

Explanation Instance-dependent cost-sensitive learning (IDCSL) considers instance-importance (e.g. financial loss) into the classification model. IDCSL differs from the standard classification because the latter assumes that any misclassification is equally important. This assumption might not lead to an optimal result from a business perspective. In fraud detection, misclassifying true fraudsters often imply a higher loss than mistaking true non-fraudsters: For example, transfer and insurance fraud have instance-dependent costs. This thesis aims to compare, through an experimental setup, IDCSL models to measure their predictive performance by standard evaluation metrics and financial losses. We can also compare with standard classification models to find out if the cost-sensitive approach leads to better results. Depending on the students’ background and preference, we can dig into a more advanced setup to include the realistic scenario of hidden fraudsters: That is, some normal instances are actually true fraudsters but they are mislabeled by previous investigations. This scenario remains as an unsolved challenge to most fraud detection systems.

Reading list • General Fraud Detection Reference: Baesens, Bart, Vlasselaer, Véronique Van, & Verbeke, Wouter. (2015). Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques. Hoboken, NJ, USA: John Wiley & Sons. • Bahnsen, A. C., Aouada, D., & Ottersten, B. (2015). Example-dependent cost-sensitive decision trees. Expert Systems with Applications, 42(19), 6609-6619. • Höppner, S., Baesens, B., Verbeke, W., & Verdonck, T. (2020). Instance-Dependent Cost-Sensitive Learning for Detecting Transfer Fraud. arXiv preprint arXiv:2005.02488.

Prerequisites • Experience with Python or R. • Good knowledge for models for classification tasks (e.g. logistic regression, random forest).

Topic 30: Looking for a needle in a network: Anomaly Detection in Networks


Summary Detecting anomalies is an essential task for several applications such as health care, finance, and fraud detection. In this thesis, we dig into anomaly detection for network data (i.e. graph). The thesis aims to compare state-of-the-art techniques in anomaly detection for graphs.

Explanation Network data allows to represent complex interdependence between objects: For instance, a fraud ring represents a complex connection between fraudsters. Networks, or graphs, can be also found in other domains such as social media and health care. In a network, we can detect anomalies if a graph object deviates from the normal patterns within the structure. The thesis aims to compare techniques in anomaly detection for graphs. Through an empirical analysis, we aim to better understand the algorithms and their strengths and weaknesses. Depending on the students’ background, we can consider state-of-the-art techniques that come from the literature in deep learning.

Reading list • Akoglu, L., Tong, H., & Koutra, D. (2015). Graph based anomaly detection and description: a survey. Data mining and knowledge discovery, 29(3), 626-688. • Easley, D., & Kleinberg, J. (2010). Networks, crowds, and markets (Vol. 8). Cambridge: Cambridge university press.

Prerequisites • Experience with Python or R. • Willingness to learn machine learning in graphs.

Topic 31: Ontwikkeling van een data-gedreven beslissingsondersteuningsarchitectuur voor Ekonomika


Summary Ekonomika heeft afgelopen jaar sterk geïnvesteerd in haar IT infrastructuur. Er is een nieuwe website ontworpen met een daarbij horende vernieuwde backend in een Salesforce CRM systeem. In deze backend wordt zowat alle transacties van de site opgeslagen. Voorbeelden hiervan zijn het aankopen van boeken of het registreren voor evenementen door studenten. Naast de transactie zelf, slaan we heel wat persoonsgegevens van studenten op, zoals: studierichting, studiejaar of kotadres. Deze gegevens zouden ons theoretisch in staat stellen om het gedrag van studenten in kaart te brengen. Mogelijke analyses hierop zijn de evolutie van het soort evenementen waar een student aan deel neemt doorheen zijn studiecarrière. Al jaren veronderstellen we binnen Ekonomika dat jonge studenten meer deelnemen aan onze feestactiviteiten (TD, Cantus, …) en oudere studenten zich meer focussen op ons career aanbod (jobbeurs, skill-programs, LCC,...). Tot nu toe zijn we nooit in staat geweest om dit kwantitatiever te staven dan “op het evenement leek het ons een jong publiek”. We zouden dit heel graag op meer analytische manier onderzoeken.

Explanation Naast dit soort meer strategische analyses, zou er ook aan meer operationele analyses gedaan kunnen worden. Deze zouden via dashboarding op laagdrempelige wijze naar het bestuur van Ekonomika gebracht kunnen worden om hun te ondersteunen in hun dagelijkse werking. Voorbeelden hiervan zouden kunnen zijn: de kursusdienst ondersteunen in hun voorraadbeheer, event registraties voor marketing zodat ze weten welke events meer promotie nodig hebben of welke groepen ze meer moeten targetten. Binnen Ekonomika hebben we voor deze projecten wel de interesse, maar missen we heel wat kennis. Enerzijds hebben we nog geen concreet beeld van wat er allemaal mogelijk is. We hebben al een aantal ideeën, maar hier kan onwaarschijnlijk met de nodig academische hulp nog een pak meer worden uitgehaald dan waar we in eerste instantie aan dachten. Anderzijds is er het technische aspect. We weten niet wat de correcte manieren zijn om deze analyses uit te voeren. Het uiteindelijke doel van de thesis is dus om zowel het dagelijks bestuur als de raad van bestuur van ekonomika te kunnen ondersteunen met operationele en strategische analyses van onze data. Een ander project binnen de ICT werkgroep van Ekonomika, is het optimaliseren van de site zelf. Hiervoor wordt er gebruikt gemaakt van tools als hotjar, google analytics of google tag manager. Deze tools stellen ons in staat om het sitegebruik van studenten te volgen. Met deze gegevens willen we aan de slag gaan om onze site beter in te delen en zo de studenten gemakkelijk terug te laten vinden wat hij of zij zoekt. Hier

botsen we echter opnieuw op dezelfde muur van een gebrek aan kennis om te weten hoe we dit project best aanpakken.

Reading list Topic voor literatuurstudie moet nog vorm krijgen, maar zou kunnen gaan richting toepassing van data analytics in non-profit of SME. Ook een analyse van analytics-platforms kan een mogelijk thema worden.

Prerequisites • Preferably Dutch speaking

Topic 32: CRM Analytics at Ekonomika Alumni Association


Summary Ekonomika Alumni Association: profiling members and activities Ekonomika Alumni is the official alumni association of the Faculty of Business and Economics (FEB) of KU Leuven. It organizes activities for its more than 50000 alumni and aims at strengthening the network of FEB graduates, both in terms of professional connections and leisure or cultural gatherings. The association organizes lectures, company visits and meetings with CEO’s, but also leisure activities such as a BBQ, a cantus or an alumni trip. The alumni association has a database with information on its members (where do they live and work, when did they graduate, are they paying members…) and their participation in activities organized by the association. In order to optimize its organization, Ekonomika Alumni wants to analyze the profile of its members and their participation in the alumni activities. The main objective is to explore the match between the characteristics of the members and the properties of the organized activities, in order to optimally adapt the range of activities to the characteristics of its members. Also, an international benchmark of alumni activities will be performed in order to position Ekonomika Alumni among alumni associations at other economics and business faculties.

Explanation /

Reading list Nam, Dalwoo, Junyeong Lee, and Heeseok Lee. "Business analytics use in CRM: A nomological net from IT competence to CRM performance." International Journal of Information Management 45 (2019): 233-245. Anshari, Muhammad, et al. "Customer relationship management and big data enabled: Personalization & customization of services." Applied Computing and Informatics 15.2 (2019): 94-101. Sharma, Sarika. "Big Data Analytics for Customer Relationship Management: A Systematic Review and Research Agenda." International Conference on Advances in Computing and Data Sciences. Springer, Singapore, 2020.

Prerequisites /

Topic 33: Identifying Bait-and-Click Listings

Promotor Johannes De Smedt

Summary Online reviews can be found everywhere. With the growth of not just eshops but platforms in general, there is a rising number of products which gather reviews to later on serve for different, inferior products. This is called bait-and-click and poses a real problem in review fraud.

Explanation The goal of the dissertation is to collect bait-and-click listings and their reviews (e.g. from Amazon.com) and craft a methodology for identifying listings which suffer from this problem. In a second instance, a wider investigation into sellers can be performed. The use of text mining will probably yield an appropriate way to identify when product listings and their reviews, as well as their scores don't match.

Reading list Rosso, Mark A., and Bernard J. Jansen. "Smart marketing or bait & switch: competitors' brands as keywords in online advertising." In Proceedings of the 4th workshop on Information credibility, pp. 27-34. 2010. https://arstechnica.com/tech-policy/2020/12/amazon-still-hasnt-fixed-its-problem-with-bait-and-switch-reviews/?utm_brand=ars&utm_social-type=owned&utm_source=facebook&utm_medium=social&fbclid=IwAR0RqODynpt-VD80vrXeX2cDls3q1gy0Ho3TYgTRuFuhItTYGv6KfFtVc70

Prerequisites Basic knowledge of Python, text mining

Topic 34: From web log to process data


Summary Web logs capture the activity of users during website visits. They contain information such as the pages viewed, their origin (other websites, ads, etc.), time on site, etc. They are particularly useful to get a grasp of what people are looking for before conversion (e.g. buying something). To perform funnel analysis and apply predictive techniques, however, it would be useful to approach them from a process mining perspective. This dissertation focuses on converting web logs into appropriate event logs using the correct aggregation.

Explanation The main challenge will be finding the correct level of aggregation. Web logs consist of web pages which are more numerous than activities in a typical business process. Furthermore, the timing aspect and particular events that happen that make a particular action an activity are all to be investigated. Techniques such as (word) embeddings, network embeddings, clustering, etc. can all be used to perform this analysis and aggregation.

Reading list van Der Aalst, Wil. "Data science in action." In Process mining, Springer, Berlin, Heidelberg, 2016. Poggi, Nicolas, Vinod Muthusamy, David Carrera, and Rania Khalaf. "Business process mining from e-commerce web logs." In Business process management, pp. 65-80. Springer, Berlin, Heidelberg, 2013. Makanju, Adetokunbo AO, A. Nur Zincir-Heywood, and Evangelos E. Milios. "Clustering event logs using iterative partitioning." In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1255-1264. 2009. Perez-Castillo, Ricardo, Barbara Weber, Jakob Pinggera, Stefan Zugal, Ignacio Garcia-Rodriguez de Guzman, and Mario Piattini. "Generating event logs from non-process-aware systems enabling business process mining." Enterprise Information Systems 5, no. 3 (2011): 301-335.

Prerequisites /

Topic 35: A survey of predictive and prescriptive techniques in disaster management


Summary Business analytics are often considered along three pillars: descriptive, predictive (regression, forecasting, classification), and prescriptive (optimisation) analytics. Mostly, these pillars are researched separately, at least in the case of the latter two. Recently, however, efforts have been made to find synergies, e.g., to use optimisation in classifcation (e.g. neural networks) or in hybrid form (e.g. reinforcement learning). Lately, there are also numerous works on using the output of predictive algorithms in prescriptive optimisation models. This dissertation focuses on performing a literature review of the efforts of using both types of analytics in the context of disaster management.

Explanation The dissertation should construct an overview of literature where optimisation techniques benefit from the input from machine learning-based approaches both in descriptive (e.g. embeddings/representation learning, clustering) and predictive analytics (e.g. forecasts of decision variables of optimisation models).

Reading list Bertsimas, Dimitris, and Nathan Kallus. "From predictive to prescriptive analytics." Management Science 66, no. 3 (2020): 1025-1044. Bertsimas, Dimitris, Leonard Boussioux, Ryan Cory Wright, Arthur Delarue, Vassilis Digalakis Jr, Alexandre Jacquillat, Driss Lahlou Kitane et al. "From predictions to prescriptions: A data-driven response to COVID-19." arXiv preprint arXiv:2006.16509 (2020). Holsapple, Clyde, Anita Lee-Post, and Ram Pakath. "A unified foundation for business analytics." Decision Support Systems 64 (2014): 130-141. Alem, Douglas, et al. "Building disaster preparedness and response capacity in humanitarian supply chains using the Social Vulnerability Index." European Journal of Operational Research (2020).

Prerequisites At least an intermediate knowledge of predictive and prescriptive techniques

Topic 36: Applying network analytics in a prescriptive context for disaster management


Summary Many optimisation models rely on representing the real world as close as possible. However, typically they are limited in the number of variables that can be used to build models as optimisation can become computationally expensive quickly. This dissertation looks at a particular problem in disaster management where network variables (e.g. road networks, supplier networks, etc.) are used in disaster relief support.

Explanation In order to obtain a scalable but strong representation of disaster relief networks, a multitude of techniques should be studied and/or implemented. Network embeddings can be used to reduce dimensionality, which in turn should improve computational performance without impacting optimisation performance. Simulation can be used in order to generate different scenarios.

Reading list Alem, Douglas, et al. "Building disaster preparedness and response capacity in humanitarian supply chains using the Social Vulnerability Index." European Journal of Operational Research (2020). Noyan, Nilay, Burcu Balcik, and Semih Atakan. "A stochastic optimization model for designing last mile relief networks." Transportation Science 50.3 (2016): 1092-1113. Behl, Abhishek, and Pankaj Dutta. "Humanitarian supply chain management: a thematic literature review and future directions of research." Annals of Operations Research 283.1 (2019): 1001-1044.

Prerequisites Basic knowledge of Python, mathematical programming (e.g. CPLEX), analytics, optimisation(, simulation)

Topic 37: Overcoming bias in Belgian news outlets


Summary Aggregating different news outlets helps people get a holistic view of the news. Most people, however, stick with only one or a few outlets. For this dissertation, the aim is to check the possibility of applying matrix-based news aggregation in the Belgian (or Flemish) landscape.

Explanation Different news outlets bring the same story from different viewpoints to cater to their readership. By combining different viewpoints, a wisdom of crowds surfaces that offers a less biased version. The goal of this dissertation is to pick a topic, or multiple topics, and apply matrix-based news aggregation (see reading list) in the Belgian/Flemish context on a particular topic (e.g. Covid 19).

Reading list Hamborg, F., Meuschke, N., & Gipp, B. (2020). Bias-aware news analysis using matrix-based news aggregation. International Journal on Digital Libraries, 21(2), 129-147.

Prerequisites Basic knowledge of Python, text analysis

Topic 38: Preventive Prescriptive Process Monitoring


Summary Prescriptive Process Monitoring serves as an alarm-based intervention technique to guide the process in the correct way in order to achieve the best throughput time for a certain KPI. Additionally, this has been demonstrated to be more effective than calculating the next most likely events (Poustcchi et al., 2020). This dissertation is focused on preventing process owners to overrule the recommendations and indicate the actions that will lead to the worst throughput time

Explanation Prescriptive Process Monitoring serves as an alarm-based intervention technique to guide the process in the correct way in order to achieve the best throughput time for a certain KPI. However, it might be that this is not always followed through in practice. This dissertation is aimed to do the exact opposite: prevent the process owner from doing the worst possible actions that will inevitably lead to the worst throughput time(s)

Reading list Fahrenkrog-Petersen, S. A., Tax, N., Teinemaa, I., Dumas, M., De Leoni, M., Maggi, F. M., & Weidlich, M. (2019). Fire now, fire later: Alarm-based systems for prescriptive process monitoring. ArXiv. Poustcchi, K., Krasnova, H., Weinzierl, S., Zilker, S., Stierle, M., Matzner, M., & Park, G. (2020). From predictive to prescriptive process monitoring: Recommending the next best actions instead of calculating the next most likely events. WI2020 Zentrale Tracks, December 2019, 364–368. https://doi.org/10.30844/wi_2020_c12-weinzierl Teinemaa, I., Tax, N., de Leoni, M., Dumas, M., & Maggi, F. M. (2018). Alarm-Based Prescriptive Process Monitoring. In M. Weske, M. Montali, I. Weber, & J. vom Brocke (Eds.), Business Process Management Forum (pp. 91–107). Springer International Publishing.

Prerequisites At least an intermediate knowledge of predictive and prescriptive techniques

Topic 39: Efficiently mitigate bias in Classification with the XGBoost model


Summary Compare FairXGBoost with XGBoost in combination with pre- or postprocessing Bias Mitigation algorithms based on fairness and performance

Explanation FairXGBoost is a newly proposed algorithm that uses an in-processing bias mitigation technique to ensure fair classification. However, several pre- and postprocessing bias mitigation algorithms have already been developed that are compatible with XGBoost. This dissertation consists of two major parts: converting this FairXGBoost concept as described in the attached paper to a Python implementation, and comparing it with pre- and/or post-processing techniques in combination with a XGBoost model without fairness adjustments.

Reading list Bellamy, R. K. E., Dey, K., Hind, M., Hoffman, S. C., Houde, S., Kannan, K., Lohia, P., Martino, J., Mehta, S., Mojsilovic, A., Nagar, S., Natesan, K., John, R., Diptikalyan, R., & Prasanna, S. (2018). AI Fairness 360: an extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. ArXiv E-Prints, abs/1810.0, 20. http://arxiv.org/abs/1810.01943 Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2019). A Survey on Bias and Fairness in Machine Learning. http://arxiv.org/abs/1908.09635 Ravichandran, S., Venkatesh, B., Khurana, D., & Edakunni, N. U. (2020). FairXGBoost: Fairness-aware classification in XGBoost. ArXiv.

Prerequisites Medium to advanced knowledge in Python, Medium knowledge of statistical modelling

Topic 40: Evaluating modelling notations and modelling tools from a usability perspective

Promotor Monique Snoeck

Summary The Physics of Notations is a framework to evaluate visual notations. It’s principles can be applied to improve the understandability of a modelling language. The Unified Modelling Language (UML) is a frequently used modelling language in model-driven engineering. MERODE is a modelling approach for domain modelling that uses both the UML-notation and an own notation. The goal of this thesis is to make a comparative analysis of the notations as used in MERLIN and a UML modelling tool using the Physics of Notations theory.

Explanation Modelling languages are often defined on a semantic level, whereas the visual aspects of the languages are specified to a lesser extent. Therefore, the visual aspects of a modelling language often depend on the modelling tool you use. The Physics of Notations is a framework to evaluate visual notations but it can also be used to evaluate tools. It’s principles can be applied to improve the understandability of a modelling language. The Unified Modelling Language (UML) is a frequently used modelling language in model-driven engineering. MERODE is a modelling approach for domain modelling. It is complemented with a web-application for modelling (MERLIN) and a code generator (CodeGen). The goal of this thesis is to make a comparative analysis of the notations used in MERLIN and a UML modelling tool using the Physics of Notations theory. First, a literature review will provide an overview of the existing evaluations of UML and UML tools. Then, students will make an analysis of MERLIN and a chosen UML tool. In a second phase, students may also consider usability theory to evaluate the modelling tools from the perspective of user interface design. They will suggest a set possible future improvements for both tools. If times permits, they can develop an improvement for one of the tools and validate it.

Reading list • Moody, D. (2009). The physics of notations: Toward a scientific basis for constructing visual notations in software engineering. IEEE Transactions on Software Engineering, 35(6). https://doi.org/10.1109/TSE.2009.67 • van der Linden, D., & Hadar, I. (2019). A Systematic Literature Review of Applications of the Physics of Notations. IEEE Transactions on Software Engineering, 45(8), 736–759. https://doi.org/10.1109/TSE.2018.2802910

Prerequisites Knowledge of UML and MERODE

Topic 41: Assessing the impact of formative assessments and adaptive learning paths on learners' success.


Summary The goal of the master thesis is to investigate learner behaviour and whether different learner behaviour types lead to more consistent success/study results.

Explanation Active learning has proven to have a positive effect on achieving learning goals. Active learning can be stimulated in many different ways, including flipped classrooms, formative assessments and using adaptive learning paths. Teachers of the BIS-course at FEB (D0H27A and D0T12A) have experimented with different approaches to stimulate students' active learning during the semesters as opposed to postponing all learning activities to the exam preparation period. Data has been collected about students' use of formative assessments, open exercises with adaptive release of feedback and a MOOC.

The goal of the master thesis is to investigate learner behaviour and whether different learner behaviour types lead to more consistent success/study results? Specifically, students will address the following research questions

- To what extent are scores obtained for formative assessments and attempts for open exercises predictive for learner success ? - To what extent is the timing of participation to active learning activities predictive for learner success ?

Besides results from the grade center, also log data from Toledo and the UML MOOC can be investigated. If possible data from a previous run of the course will be sought for to serve as a baseline.

Reading list - Siemens G. Learning Analytics: The Emergence of a Discipline. American Behavioral Scientist. 2013;57(10):1380-1400. doi:10.1177/0002764213498851

- Jelena Jovanović, Dragan Gašević, Shane Dawson, Abelardo Pardo, Negin Mirriahi, Learning analytics to unveil learning strategies in a flipped classroom, Internet and Higher Education 33 (2017) 74–85.

- Fatima Harrak, François Bouchet, Vanda Luengo, Pierre Gillois, Profiling students from their questions in a blended learning environment, LAK '18: Proceedings of the 8th

International Conference on Learning Analytics and Knowledge, March 2018 Pages 102–110, https://doi.org/10.1145/3170358.3170389

Prerequisites /

Topic 42: Detecting patterns of Students' self-regulation strategies across a collection of courses.


Summary The goal of this thesis is to investigate students' self-regulation strategies in dealing with a collection of courses simultaneously, e.g. whether mandatory task assignments for one course cause students to lag behind on other courses that do not have mandatory assignments.

Explanation Active learning is known to have a positive impact on a student's learning outcome. Teachers thus try to stimulate student's active learning behaviour by means of e.g. class preparation assignments, writing assignments and tests. However, unless activities contribute to the final grade, students keep a relative freedom in how they organize and regulate their learning process. This becomes particularly relevant when considering the behaviour of students across different courses followed simultaneously. Given that in the current online settings a significant part of student activities can be monitored through the learning management system (LMS), an analytic approach to the behaviour of students may shed light on this question. In particular, clustering techniques have been successfully applied to detect learner profiles based on the way students interact with and use the learning environments.

The goal of this thesis is to investigate students' self-regulation strategies in dealing with a collection of courses simultaneously, e.g. whether mandatory task assignments for one course cause students to lag behind on other courses that do not have mandatory assignments.

Reading list - Siemens G. Learning Analytics: The Emergence of a Discipline. American Behavioral Scientist. 2013;57(10):1380-1400. doi:10.1177/0002764213498851

- Jelena Jovanović, Dragan Gašević, Shane Dawson, Abelardo Pardo, Negin Mirriahi, Learning analytics to unveil learning strategies in a flipped classroom, Internet and Higher Education 33 (2017) 74–85.

- Fatima Harrak, François Bouchet, Vanda Luengo, Pierre Gillois, Profiling students from their questions in a blended learning environment, LAK '18: Proceedings of the 8th International Conference on Learning Analytics and Knowledge, March 2018 Pages 102–110, https://doi.org/10.1145/3170358.3170389

Prerequisites /

Topic 43: Automated testing of applications


Summary In order to test a software application thoroughly, many, many test scenarios need to be defined and executed, making it a time-consuming (and somewhat boring) process. When the software application is then adapted according to the defects that were found, all the testing needs to be performed again. The goal of this thesis is test automation tools, and evaluate different approaches for supporting the task of requirements-based testing. A good candidate could be capture and replay tools. One (open source/free) tool should be selected and tested, possibly with the help of students.

Explanation Testing a time consuming process. In order to test a software application thoroughly, many, many test scenarios need to be defined and executed. When the software application is then adapted according to the defects that were found, all the testing needs to be performed again. Several open source testing tools are out there to help testers in some way to ease their job, see e.g; https://java-source.net/open-source/testing-tools. A specific category are capture and replay tools, which allow capturing actions on applications and replay them later on (like macros).

Goal of the thesis is to investigate existing tools that can help business and functional analysists performing requirements-based testing. A paper from 2013 comparing open source capture and replay tools concluded that Jacareto is the best to use https://java-source.net/open-source/testing-tools/jacareto, on of its strong points (amongst others) being that it doesn't require programming knowledge. Students should explore the market of test automation tools, and evaluate different approaches for supporting the task of requirements-based testing. One (open source/free) tool should be selected and tested. Possibly, in the second semester (March-April 2020), an experiment can be set up with students from the course Architecture and Modelling of MIS to gauge their perception of such tool for ease of use, utility and effectiveness in supporting their testing processes.

Reading list Stanislava Nedyalkova and Jorge Bernardino. 2013. Open source capture and replay tools comparison. In Proceedings of the International C* Conference on Computer Science and Software Engineering (C3S2E'13). Association for Computing Machinery, New York, NY, USA, 117–119. DOI:https://doi.org/10.1145/2494444.2494464

Prerequisites Fluent in installing and trying out technical solutions.

Topic 44: Data-aware process modelling by combining control-flow oriented process modelling with artefact-centric process modelling


Summary In recent years the importance of data aspects has been acknowledged by the process modelling community, and several approaches for data-aware process modelling have been proposed. A previous master thesis developed a Proof-of-Concept combining the Camunda Process Engine with an application generated by means of the MERODE-code generator. It was argued that such combination allows to satisfy many of the requirements set forth by the (10-year old) PhilharmonicsFlows [7] framework for data-aware process modelling.

The goal of this thesis is to expand this work conceptually and technically.

Explanation In recent years the importance of data aspects has been acknowledged by the process modelling community, and several approaches have been proposed, see [1] for an overview. Much of this research focuses on the process perspective, on how to make processes data-aware by e.g. developing connections to a database [2], or focus on developing support for verifying process properties such as safety, liveness, etc., see for example [3,4], or conformance checking of event logs against an artefact-centric process model [5]. Despite these advances, a global perspective on the relationship between a process model and a data model is still missing (e.g. in terms of an integrated meta-model), as well as a practical approach for modelers on how to tackle the balance between process modelling and conceptual modelling: what should come first, how are the models related to each other, and how modelling decisions in one of the views affect the other view. Process modelling should be "data-aware" in the sense that an existing conceptual domain model is presumed to exist, and processes should take into account the limitations imposed by that model. Possibly the process modelling may require revisiting the data model. Similarly, data modelling should be conscious of the business processes that need to be supported. As constraints set by a conceptual data model will impact processes, the conceptual data modeller should be aware of which processes are hindered or made possible to make the right decisions during data modelling. The goal of this thesis is to investigate the relationship between conceptual data modelling and process modelling, and -in order to complement existing research on data-aware processes- to address this mainly from a conceptual data modelling perspective. In particular, the artefact-centric approach could be considered as a data-driven approach, while BPMN offers a process-centric perspective. A previous master thesis developed a Proof-of-Concept combining the Camunda Process Engine with an application generated by means of the MERODE-code generator. The

students argued that such combination allows to satisfy many of the requirements set forth by the (10-year old) PhilharmonicsFlows [7] framework for data-aware process modelling.

The goal of this thesis is to expand the work performed by the previous students. The following steps may be envisaged:

- A comparison with more recent frameworks (e.g. [1]) may shed a different light on strong and weak points of the proposed approach.

- The proof of concept may be further elaborated both conceptually and technically. Conceptual elaboration may be performed by working out new case studies, e.g. along examples found in papers, and demonstrating to what level the Proof of Concept can realize the expectations.

- From a technical point of view, the proof of concept may be further elaborated so as to allow for the actual testing of more complex scenarios.

Reading list [1] S. Steinau, A. Marrella, K. Andrews, F. Leotta, M. Mecella, and M. Reichert, “DALEC: a framework for the systematic evaluation of data-centric approaches to process management software,” Softw. Syst. Model., vol. 18, no. 4, pp. 2679–2716, 2019. [2] D. Calvanese, M. Montali, F. Patrizi, and A. Rivkin, “Modeling and In-Database Management of Relational, Data-Aware Processes,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11483 LNCS, pp. 328–345, 2019. [3] A. Artale, D. Calvanese, M. Montali, and W. M. P. van der Aalst, “Enriching Data Models with Behavioral Constraints,” Ontol. Makes Sense, vol. 316, no. Dynamics 365, pp. 257–277, 2019. [4] M. Estañol, M. R. Sancho, and E. Teniente, “Ensuring the semantic correctness of a BAUML artifact-centric BPM,” Inf. Softw. Technol., vol. 93, pp. 147–162, 2018.

[5] M. Estañol, J. Munoz-Gama, J. Carmona, and E. Teniente, “Conformance checking in UML artifact-centric business process models,” Softw. Syst. Model., vol. 18, no. 4, pp. 2531–2555, 2019.

[6] R. Hull, “Artifact-Centric Business Process Models: Brief Survey of Research Results and Challenges,” in On the Move to Meaningful Internet Systems: OTM 2008, 2008, pp. 1152–1163. [7] V. Künzle and M. Reichert, “PHILharmonicFlows: Towards a Framework for Objectaware Process Management,” J. Softw. Maint. Evol. Res. Pract., vol. 23, no. 4, pp. 205–244, 2011.

Prerequisites Interest in data-aware business process modelling and enactment.For the technical implementation: Good

programming skills, willingness to dive into Process Enactment, XML, web service technology, etc.

Topic 45: Investigating the impact of dispensation requests and exchange programs on the efficiency of the ISP approval process


Summary The goal of the thesis is to investigate ISP log data in combination with log data from the dispensation request application and the student exchange application to investigate the impact of dispensation requests and student exchange on the efficiency of the ISP approval process.

Explanation The ISP approval process at the start of the academic year is particularly resource-consuming. Process optimisations would benefit to both students and administrative staff. The goal of the thesis is to investigate ISP log data in combination with log data from the dispensation request application and the student exchange application to investigate the impact of dispensation requests and student exchange on the efficiency of the ISP approval process. Students will make use of process mining techniques but also other data analysis techniques can be considered.

The main goal is to offer key insights pointing to possible improvements of these processes and the ISP approval process in particular. Students will validate key findings by interviewing relevant stakeholders.

Reading list - Dumas et al., Fundamentals of Business Process Management. https://link.springer.com/book/10.1007/978-3-642-33143-5

- Will van der Aalst, Process Mining, https://link.springer.com/book/10.1007/978-3-642-19345-3

Prerequisites Interest in data-analytics. Fluency in Python.Knowledge of Dutch may be easier when interviewing stakeholders (administrative staff) when validating findings.

Topic 46: OWN TOPIC in the domains of Requirements Engineering, Model-driven engineering, technology acceptance or technology-supported learning.


Summary Students can suggest their own topic for a master's thesis. However, there are three conditions, see further.

Explanation Students are allowed to suggest their own topic for a master's thesis. However, there are three conditions:

1. The topic needs to be in/close to my field of research 2. You need to be a team of at least two students, preferably three students 3. The topic needs to be worked out and approved by me before it can be assigned.

In working out the topic, please provide a. an initial literature review, b. a carefully thought out research question c. a concrete and feasible research plan d. a risk analysis + mitigation: what could go wrong, and what is potential plan B.

Reading list /

Prerequisites Highly motivated students, high self-regulation capabilities, curiosity.

Topic 47: A practical (Self-)Assessment tool for Digital Product Management

Promotor Pieter Hens

Summary Goal: develop an assessment tool for Lean (Digital) Product Management. The tool measures the maturity level of the digital product manager (person) and/or the product organisation in regards to Lean Product Management. The end goal is a validated assessment tool that can be used in practice. Why: An assessment tool helps you to grow in the job role you practice (in this case a Digital Product Manager / Product Owner / Business Analist / Functional Analist). It helps to identify the gaps of knowledge and proposes next steps in learning and development. Gaps that need to be filled to create better software products.

Explanation Problem: A lot of software applications that are being build today, will never be used. That is millions of euros on wasted effort. There have been numerous studies reporting that 80% of features being build in software are never used at all, and that most software projects in an organisation do not create any impact at all. They have zero effect. Idea: Because of the above, the Lean movement is making its way into software development. The idea behind lean software development is to not simply develop any idea that someone has, but to validate your idea with the actual market (enduser) before you continue and commit. You experiment and test the assumptions you have of a certain "software requirement". Hereby first increasing your confidence level in a certain idea, before you release it to the big audience. This increases your chances on succes and actually creating an organisational-impact. For example: skyscanner.com performs 100s of these small experiments per day. Testing tiny new ideas, before actually building them. Consequence: Because of the above, the classical job role of business analist and functional analist (the person gathering and elaborating the software requirements) is evolving. Broader tasks need to be taken up: product strategy (why are we doing this?), product discovery (validating assumptions) and actual requirement engineering (elaborating on the details). Based on this evolution, there is a high demand in the (software product) industry for any guide rails / frameworks / tools and techniques to better structure the work as a Lean Product Manager. Thesis: The first step to better structure your work in this new way of working is knowing

where you currently are and knowing which gaps still need to be filled in. An assessment tool and maturity ranking can help you with this. The goal of this thesis is to create such a tool. Example: https://www.whitmaan.com/product-management-self-assessment-tool How: • Pre-study: literature study, getting to know Lean Product Management • Literature and market study: which assessment tools / questionnaires are already developed. What are the needs of the industry regarding this assessment? • Assessment tool: creating a proprietary assessment tool, based on the Strategy - Discovery - Scaling stages in Software Product Development. Keywords: Software development / Requirements engineering / Lean Startup / Agile development

Reading list /

Prerequisites • Interest in software development processes (how software is being developed) • Interest in requirements engineering

Topic 48: Novel Graph Neural Network Architecture for Credit Card Fraud Detection

Promotor Sandra Mitrović

Summary Graphs, or networks, naturally appear in several applications such as bioinformatics, social analysis, and fraud detection. In this thesis, we explore graph neural networks (GNNs) to propose a novel architecture for credit card fraud detection.

Explanation In numerous applications, data contains complex interdependence that are better represented by graphs: Fraud detection is a challenging application in which convoluted schemes arise to undertake malicious activities. However, standard machine learning models cannot fully exploit graph data. We explore graph neural networks (GNNs), stemming from deep learning, which can exploit graph data to achieve state-of-the-art results. Despite the success of the GNNs in several applications, few studies have focused on fraud detection. Thus, this thesis aims to propose a novel architecture that addresses the specific challenges of fraud detection. Firstly, students are expected to review the literature to understand the merits of recent models. Then, students will propose a novel GNN that will be further compared in an experimental setup to evaluate the predictive performance.

Reading list - Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., & Philip, S. Y. (2020). A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems. - Chami, I., Abu-El-Haija, S., Perozzi, B., Ré, C., & Murphy, K. (2020). Machine learning on graphs: A model and comprehensive taxonomy. arXiv preprint arXiv:2005.03675.

Prerequisites · Experience or willingness to learn Python. · Willingness to learn machine learning in graphs. · This thesis might require the use of a cloud platform (e.g. Google Cloud)

Topic 49: Graph Generation with Deep Learning methods

Promotor Sandra Mitrović

Summary The generation of realistic graphs is a promising tool for studying networks in several applications: For instance, a generative model can discover new molecules out of previous chemical structures. In this thesis, we aim to propose a novel technique that can learn from a real network in order to reflect similar graph properties.

Explanation The lack of available open-source datasets presents an issue for studying many real-life problems; hence, the need of generating additional, realistic datasets has been well recognized. The generation of realistic graphs has been addressed before the popularity of deep learning. Traditional generative models work based on structural assumptions that the researcher can modify at will. Thus, these methods are inherently hand-engineered which might not reflect real graph data. Generative models that reproduce graphs from observed networks remains an open problem. Recent progress in generative models based on deep learning has taken a step forward to generate complex data. However, previous efforts have been limited by the size of output and learning from a single graph. This thesis aims to propose a novel technique that overcomes some limitations of existing models. Firstly, students are expected to review the literature to understand the merits of recent models. Then, the proposed technique will be further compared with traditional and deep generative baselines in an extensive experimental setup.

Reading list - You, J., Ying, R., Ren, X., Hamilton, W., & Leskovec, J. (2018, July). Graphrnn: Generating realistic graphs with deep auto-regressive models. In International Conference on Machine Learning (pp. 5708-5717). PMLR. - You, J., Liu, B., Ying, R., Pande, V., & Leskovec, J. (2018). Graph convolutional policy network for goal-directed molecular graph generation. arXiv preprint arXiv:1806.02473. - Newman, M. (2018). Networks. Oxford university press.

Prerequisites · Experience or willingness to learn Python. · Willingness to learn machine learning in graphs. · This thesis might require the use of a cloud platform (i.e. Google Cloud)

Topic 50: Feature engineering framework for event logs

Promotor Seppe vanden Broucke

Summary Feature engineering is the process of extracting features from raw data using data mining techniques. Recently, a number of frameworks for automated feature engineering of temporal and relational data sets have been developed. In this project students will be asked to expand on these frameworks and create a general feature engineering framework for event logs. The project therefore has two main components. First, students will be required to study frameworks for automated feature engineering (e.g. https://www.featuretools.com/) and investigate their suitability for event logs. Second, students will propose a general feature engineering framework for event logs.

Explanation /

Reading list /

Prerequisites Programming in Python/R

Topic 51: A comparison of personal data management platforms (with Digita)

Promotor Tom Haegemans

Summary Since the Cambridge Analytica scandal and the introduction of the GDPR, people are more aware of the importance of their personal data trail and their privacy. This led to a plethora of novel personal data management tools (such as Dock.io) and technologies (such as Solid and Blockchain) to manage personal data in a better way. However, because there are so many tools and technologies, it is difficult to get an overview of which of them are complementary and which of them are competitors. As such, the aim of this thesis is to create a framework to evaluate such tools and techniques and to give an overview of the most popular ones. This thesis will be in cooperation with Digita. Digita is a start-up that enables organisations to connect to (or set up) a Solid-based personal data (intra) web so they can easily move to a more resilient personal data infrastructure based on open standards.

Explanation /

Reading list https://www.youtube.com/channel/UCA22hu-0VEHt5tCc7jad74g

Prerequisites /

Topic 52: Rank-aware uplift modeling with a multi-task loss

Promotor Wouter Verbeke

Summary Uplift models estimate the effect of a treatment (e.g., a marketing campaign) on an outcome of interest (e.g., customer churn). In this dissertation, students will implement an advanced approach for improving the performance and use of these uplift models, by re-formulating the objective as ranking instances from large to small effect.

Explanation Uplift models aim to accurately rank instances based on their individual treatment effect or uplift. Nonetheless, classic uplift modeling methods do not take ranking into account during model training. Recently, an attempt was made to directly incorporate ranking information during the optimization process of the uplift model by using learning to rank methods. The main downsides of this method are relatively poor generalization performance and sacrifice of causal effect point estimates. The student will investigate uplift modeling with a joint optimization objective that takes ranking into account without sacrificing point estimate accuracy, using multi-task neural networks. The steps to be taken are as follows: (i) review and understand the relevant intersection of uplift modeling, learning-to-rank, and multi-task learning literature. (ii) Implement a neural network with a multi-task loss (one ranking-based, one classic loss) (iii) Empirically investigate and evaluate the behaviour of said model. The successful execution and documentation of these steps will lead to an innovative thesis with application possibilities in fields such as marketing and healthcare.

Reading list - Devriendt, F., Van Belle, J., Guns, T., & Verbeke, W. (2020). Learning to rank for uplift modeling. IEEE Transactions on Knowledge and Data Engineering. - Chen, W., Liu, T. Y., Lan, Y., Ma, Z. M., & Li, H. (2009). Ranking measures and loss functions in learning to rank. Advances in Neural Information Processing Systems, 22, 315-323. - Ruder, S. (2017). An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098.

Prerequisites Preferably, students have a basic knowledge of machine learning and are able to program in Python.

Topic 53: Understanding and correcting selectivity in the sentiment derived from Flemish Tweets


Summary This thesis is in collaboration with Statistics Flanders. Statistics Flanders is the network of Flemish government agencies that develop, produce and publish official statistics. We aim to offer key figures and data about Flanders so that everyone has the right information to make well-founded decisions: citizens, organisations, companies and policy makers. We will tell stories with numbers about Flanders: about the people who live and work here, our economy and our environment, and about our place in the world. And we aim to make data openly available so we maximise the use of our data. You can find more information about us on www.statistiekvlaanderen.be One of our strategic pillars is data innovation. This is why we continuously monitor the developments in the data landscape and see which innovations can be leveraged to create better or new official statistics. One such data innovation is the availability of organic data sources (also called ‘big data’) and new machine learning techniques to analyse these data. We believe that these new data sources, in combination with advanced analytical techniques, could be leveraged to create new official statistics, and improve the existing official statistics. The thesis topic described below contain our own ideas for innovation. By working on these topics you will help create a better understanding of our region, which will help organizations and policy makers make better decisions.

Explanation Understanding and correcting selectivity in the sentiment derived from Flemish Tweets The availability of public social media data and rapid advances in natural language processing algorithms to automatically interpret text have allowed for the analysis of a continuous stream of signals being sent out by people. Analyzing these signals, such as Facebook posts, Twitter tweets, and Instagram photo’s, using natural language processing techniques is a low-cost and high-frequency method of assessing the sentiment of the region. However, there are significant issues related to selectivity associated with the creation of such a statistic, e.g.: • The population present on twitter compared to the general Flemish population (“The average Flemish Twitter user” cannot be considered the same as “The average Flemish citizen”) • Some people might only use Twitter to Tweet about certain topics • Some people might only use Twitter during certain times of the day/week (“The weekend Twitter user” vs “The daily twitter user”). This would mean we measure sentiment from a different population at different times.

In this thesis you will explore which of these selectivity issues are present in the sentiment extracted from Flemish Tweets as well as evaluate avenues to alleviate these issues. Knowledge of Dutch is a nice-to-have for this topic.

Reading list - Biffignandi, S., Bianchi, A., & Salvatore, C. (2018, June). Can Big Data provide good quality statistics? A case study on sentiment analysis on Twitter data. In Int. Total Surv. Error Workshop ITSEW-2018 DISM-Duke Initiat. Surv. Methodol. - Agarwal, A., Xie, B., Vovsha, I., Rambow, O., & Passonneau, R. J. (2011, June). Sentiment analysis of twitter data. In Proceedings of the workshop on language in social media (LSM 2011) (pp. 30-38). - Kouloumpis, E., Wilson, T., & Moore, J. (2011, July). Twitter sentiment analysis: The good the bad and the omg!. In Proceedings of the International AAAI Conference on Web and Social Media (Vol. 5, No. 1).

Prerequisites Python programming, machine learning, text analysis

Topic 54: Causal modeling for pricing


Summary In this exploratory thesis research project, students are to investigate how causal machine learning can be used for pricing. Causal machine learning aims to learn causal relations for predicting an outcome depending on decision variables, for instance to predict whether a customer purchases a product depending on the price that is charged or a discount that is given.

Explanation In a first step, students are to perform a literature review on the state-of-the-art in data-driven pricing methods and to learn about causal machine learning. A second step is the elaboration of an approach for adopting causal machine learning for pricing. A third step is to set up an experiment to evaluate the performance of the envisioned approach in comparison with a baseline pricing strategy.

Reading list - Verbeke, W., Olaya, D., Berrevoets, J., & Maldonado, S. (2020). The foundations of cost-sensitive causal classification. arXiv preprint arXiv:2007.12582. - Olaya, D., Verbeke, W., Van Belle, J., & Guerry, M. A. (2021). To do or not to do: cost-sensitive causal decision-making. arXiv preprint arXiv:2101.01407. - Bertsimas, D., & Perakis, G. (2006). Dynamic pricing: A learning approach. In Mathematical and computational models for congestion charging (pp. 45-79). Springer, Boston, MA.

Prerequisites Basic knowledge of machine learning.

Topic 55: Prescriptive process analytics


Summary In process analytics, the task of predicting which outcome event will occur given a limited set of possible outcomes is known as decision point mining. A variety of methods have been proposed and applied for addressing this task, where decision trees are often used to understand associations between variables and the decisions made. Instead of predicting decisions, in this research we aim to implement causal machine learning models, like uplift modeling methods, to understand what can be done (a specific treatment) to reach a desired outcome event.

Explanation The main challenge will be coming up with a way to implement causal prescriptive methods like uplift, ITE, etc. in decision point mining. The first step will be understanding both decision mining in a process analytics context and causal machine learning methods like uplift and ITE. The next step is to make a prediction model for a decision mining problem and interpreting the results. Then, the same decision mining problem will be tackled making use of the causal machine learning techniques, these results will be interpreted. To wrap up it is important to understand how both methods are similar, different in their use cases, interpretation and influence on decision making in processes.

Reading list - Devriendt, F., Moldovan, D., & Verbeke, W. (2018). A literature survey and experimental evaluation of the state-of-the-art in uplift modeling: A stepping stone toward the development of prescriptive analytics. Big data, 6(1), 13-41. - Berrevoets, J., Verboven, S., & Verbeke, W. (2019). Optimising individual-treatment-effect using bandits. arXiv preprint arXiv:1910.07265. - Rozinat, A., & van der Aalst, W. M. (2006, September). Decision mining in ProM. In International Conference on Business Process Management (pp. 420-425). Springer, Berlin, Heidelberg.

Prerequisites Basic knowledge of machine learning methods, python programming skills.

Topic 56: Adaptive fraud detection to deal with concept drift


Summary Credit card fraud detection is a cat-and-mouse game between fraudsters and banks. To deal with this, banks employ machine learning models. However, when a fraud detection model is deployed, fraudsters quickly adapt their behavior, making the models ineffective. In machine learning, this is referred to as concept drift. Concept drift represents a challenge for fraud detection models as they learn from historical data. In this thesis, the student will design an adaptive fraud detection model that dynamically reacts to concept drift. This model will be tested empirically and compared with static approaches.

Explanation A major challenge for banks is credit card fraud. To detect fraud, machine learning models are used to automatically analyze credit card transactions. The problem is that fraudsters adapt their strategies when these stop being successful, resulting in these models quickly becoming ineffective. In machine learning, this issue is referred to as concept drift. To deal with concept drift, models have to continually adapt and learn. In this thesis, the student will first have to find a way to quantify concept drift. Next, this measure can be implemented in a machine learning algorithm to dynamically adapt the models. This novel method will then be tested empirically on real data.

Reading list Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM computing surveys (CSUR), 46(4), 1-37.Dal Pozzolo, A., Boracchi, G., Caelen, O., Alippi, C., & Bontempi, G. (2015, July). Credit card fraud detection and concept-drift adaptation with delayed supervised information. In 2015 international joint conference on Neural networks (IJCNN) (pp. 1-8). IEEE.Ben-David, S., Blitzer, J., Crammer, K., & Pereira, F. (2007). Analysis of representations for domain adaptation. Advances in neural information processing systems, 19, 137.


topic 1: policy analytics: comparing and analyzing insights from … · topic 1: policy analytics:...

Documents