February 3, 2023

Proposal ideas

Ideas for proposals


Proposal 1

Introduction

Data collection and storage has become commonplace in today’s world, and the medical field is no exception. However, manually processing and finding associations in large amounts of data can be a time-consuming and tedious task. It can also be difficult to identify connections between multiple large data sets. With the use of advanced technologies such as Artificial Intelligence, this task can be made much easier and more efficient.

Mikrobiota is a term that greps all the microorganisms which are situated within the same place. Furthermore, the larger eukaryotic organism where the mikro- biota organisms are habitat are called the host, i.e. viruses and bacteria are human microbiota. The estimations for the amount of microba within a hu- man body is approximatly 500 to 1000 different species of bacteria. Moreover, each bacteria does contains a strain with a genome which are structured with thousands of genes.

Clinical data are collected from patients from preformed experiments in different forms or data generated from doctors appointments. The clinical data needs to have correspondence with the experiment, in other terms the data has to repre- sent the real world in an objective point of view. There is a difference between the data (normal/general) and clinical data. All the analysis conducted for clinical data, it is important to set rules for diseases, treatment and prognostic, hence providing good and high quality of the data is prioritized when working with clinical data. However, there are a few issues when working with clinical data. One of the issues are the different goals, more specific the scientific goal with the clinical data. Is the data enough for the research. The other issue are the biases from the researchers . RNA (ribonucleic acid) is the building-block of the almost every living organ- isms and viruses respectively. RNA molecules can varies is length and its form (structure). RNA is common in RNA viruses since the RNA viruses uses the RNA to affect the host instead of the DNA as the genetic building-block, hence can be causes diseases towards humans. Transforming from DNA towards RNA is a process to synthesis protein from the actual RNA. Moreover, there are a difference between the eukaryotes and prokaryotes. To be more precise, the RNA molecule does also have the attribute to control the expressions of a gene. Hence can possible be an agents within diseases for humans.

Artificial Intelligence (AI) have been applied in many areas. AI have played a huge role in when associating microbial and despises. However, these results are sill complicated to interpret. The rapid development of AI techniques, i.e. Machine Learning (ML) & Deep Learning (DL), can process and predict from large data sets.

There are a philosophical premise used as a guideline: working with clinical data have one intention, the use of the data should provide care. The data needs to be fulfilled. In other words, the collected data should be seen as public good, and its purpose is to help patients in the future.

Data

In this study, we will be utilizing a combination of microbiota, RNA-sequence, and clinical data obtained from the Shalgrenska Hospital. It should be noted that these data are not publicly available, as they contain sensitive medical information of patients. To the best of our knowledge, there are no comparable datasets readily available. Despite this, there has been growing interest in using Artificial Intelligence and machine learning techniques to predict disease patterns and monitor outbreaks in the medical field. This project aims to contribute to these efforts by exploring the potential associations between the data sets.

Problem

The problem this study is designed to map pathogenesis, the project is con- structed as following: • Exploring the Pathogenesis of Irritable Bowel Syndrome: Utilizing a Data- Driven Approach to Predict Disease Progression by Analysing the Association between Microbiota, Clinical, and RNA Data

Context

There are difficulties to interpret the associations between different data sets. By applying different types of ML algorithms, there are easier to locate different correlations and associations between different data sets.

The health records today are stored in digital form, more specific Electronic Health Records (EHR). The EHR contains clinical reports, prescriptions to mention a few. This sheer amount of data can be compiled into a data set and be analyzed with different machine learning techniques. By applying machine learning, there is a possibility to discover new patterns to improve the decisions for patients, resulting in improved health care. However, the organization working with the data have some requirements, more precisely collect all the existing tools and infrastructure to effectively make use of the big data. Furthermore, machine learning is a subset of Artificial Intelligence. Machine learning is the actual algorithm that are designed to help with the decision making, mean- while AI is trying to replicate actual human intellect. The AI is the application which guides the machine to learn from the given data by observing patterns automatically.

Since there is such waste amount of data, the normal technologies cannot manage it. There are three Vs to present the dimensions for big data, volume, velocity and variety. Volume represents the big in the term big data. The velocity is the There are difficulties to interpret the associations between different data sets. By applying different types of ML algorithms, there are easier to locate different correlations and associations between different data sets.

The health records today are stored in digital form, more specific Electronic Health Records (EHR). The EHR contains clinical reports, prescriptions to mention a few. This sheer amount of data can be compiled into a data set and be analyzed with different machine learning techniques. By applying machine learning, there is a possibility to discover new patterns to improve the decisions for patients, resulting in improved health care. However, the organization working with the data have some requirements, more precisely collect all the existing tools and infrastructure to effectively make use of the big data. Furthermore, machine learning is a subset of Artificial Intelligence. Machine learning is the actual algorithm that are designed to help with the decision making, mean- while AI is trying to replicate actual human intellect. The AI is the application which guides the machine to learn from the given data by observing patterns automatically.

Since there is such waste amount of data, the normal technologies cannot manage it. There are three Vs to present the dimensions for big data, volume, velocity and variety. Volume represents the big in the term big data. The velocity is the processing speed of the data. Lastly, the variety is the different forms the raw data are collected. These three V’s have been defined as standard for big data. However, there are one more V emerging (forth V), veracity which is focusing on the accuracy and reliability of the collected data.

AI has gained a lot of popularity within the healthcare, especially during the COVID-19 pandemic. By applying machine learning, it was possible to make predictions for high-risk areas for further outbreaks. By combining health care data and AI have improve many areas within the health care, i.e., real-time monitoring of infections. Furthermore, AI in healthcare has opened new possibilities in improving patient outcomes. One of the ways this is achieved is by using AI-based techniques in diagnostics and other areas of healthcare. This has led to the development of new and innovative methods that have the potential to revolutionize the way we approach healthcare. The goal of this study is to explore the use of AI in healthcare and evaluate its effectiveness in improving clinical outcomes.

Unsupervised learning is a powerful machine learning technique that can be used to analyze and classify unlabelled data without human interference. The aim of this method is to uncover hidden patterns or relationships within the data itself, using various machine learning algorithms. This approach can be useful for a range of applications such as exploratory data analysis, customer segmentation, image and pattern recognition. Some of the most common algorithms used in unsupervised learning include neural networks, k-means clustering, and probabilistic clustering methods are also used to reduce the number of features in a model. Unsupervised learning can be a powerful tool for extracting insights from large and complex datasets, and can be used in a variety of industries and applications.

Goals and Challenges

The goal is to be exploring the Complexities of Irritable Bowel Syndrome: This Study Aims to Utilize Machine Learning Algorithms to Identify Correlations Among Data Sets and Uncover the Multifactorial Disease Pathogenesis. The task of analyzing large amounts of data can be challenging, especially when the data do not come with pre-defined labels. In such cases, unsupervised learning can prove to be an effective approach. This method allows the identification of patterns within the data that might have been missed or overlooked when using traditional supervised learning techniques. In this study, we aim to utilize unsupervised learning to uncover hidden patterns in our data set, which could lead to new insights and understanding of the underlying phenomena.

The current study is faced with both theoretical and technical challenges that may impact its progress and outcome. Theoretical challenges stem from the nature of the data being analyzed, including the types of data (microbiota, clinical and RNA), their quality, and the difficulties in obtaining access to clinical data. On the other hand, technical challenges relate to computational resources, including the availability and efficiency of computing power for running the data analysis, and the possibility of needing to utilize a cluster for larger data sets.

Approach

Methodology

In this study, we aim to develop a machine learning model that will be able to predict correlations between microbiota, clinical, and RNA data. To begin, the data will undergo a cleaning process, which is estimated to take one to two weeks. During this phase, the data will be tested using a simple linear regression approach to ensure its suitability for longer training runs. After the initial testing stage, descriptive statistics will be performed on the data. Next, a subset of the cleaned data will be created, and machine learning models will be developed. The subset of the data will be utilized to train the models and hyper-parameter tuning will be conducted to optimize the model. Upon completion of the training phase, the results will be analyzed to assess the accuracy of the model’s predictions and determine if further training is required. Additionally, statistical analysis will be performed to supplement our findings.

Evaluation

In this study, we aim to thoroughly evaluate the predictive capability of our machine learning model by utilizing a portion of our data for validation purposes. This reserved data will serve as an independent assessment tool and will not be involved in the training process. By conducting a correlation analysis between various data points, we aim to confirm the validity and reliability of the model’s predictions. This evaluation process will allow us to gain a deeper understanding of the underlying patterns and relationships within the data and enable us to detect any potential discrepancies or inaccuracies in our model’s output. The outcomes of this validation process will significantly contribute to the refinement and improvement of our model and provide valuable insights into the complex relationships between microbiota, clinical, and RNA data.

References

  • [1] Zodwa Dlamini, Flavia Zita Francies, Rodney Hull, and Rahaba Marima. Artificial intelligence (ai) and big data in cancer and precision oncology. Computational and Structural Biotechnology Journal, 18:2300–2311, 2020.

  • [2] Jack A Gilbert, Martin J Blaser, J Gregory Caporaso, Janet K Jansson, Susan V Lynch, and Rob Knight. Current understanding of the human microbiome. Nature medicine, 24(4):392–400, 2018.

  • [3] Ernesto Iadanza, Rachele Fabbri, Dˇzana Baˇsi ́c-CˇiCˇak, Amedeo Amedei, and Jasminka Hasic Telalovic. Gut microbiota and artificial intelligence approaches: A scoping review. Health and Technology, 10(6):1343–1358, 2020.

  • [4] David B Larson, David C Magnus, Matthew P Lungren, Nigam H Shah, and Curtis P Langlotz. Ethics of using and sharing clinical imaging data for artificial intelligence: a proposed framework. Radiology, 295(3):675–682, 2020.

  • [5] Lucilla Parnetti, Francesco Amenta, and Virgilio Gallai. Choline alphoscer- ate in cognitive decline and in acute cerebrovascular disease: an analy- sis of published clinical data. Mechanisms of Ageing and Development, 122(16):2041–2055, 2001.

  • [6] Sri Venkat Gunturi Subrahmanya, Dasharathraj K. Shetty, Vathsala Patil, B. M. Zeeshan Hameed, Rahul Paul, Komal Smriti, Nithesh Naik, and Bhaskar K. Somani. The role of data science in healthcare advancements: applications, benefits, and future prospects. Irish Journal of Medical Sci- ence (1971 -), 191(4):1473–1483, 2022.

  • [7] Muhammad Usama, Junaid Qadir, Aunn Raza, Hunain Arif, Kok-lim Alvin Yau, Yehia Elkhatib, Amir Hussain, and Ala Al-Fuqaha. Unsupervised machine learning for networking: Techniques, applications and research challenges. IEEE Access, 7:65579–65615, 2019.

  • [8] Raju Vaishya, Mohd Javaid, Ibrahim Haleem Khan, and Abid Haleem. Artificial intelligence (ai) applications for covid-19 pandemic. Diabetes Metabolic Syndrome: Clinical Research Reviews, 14(4):337–339, 2020.

  • [9] David Wang and Aisha Farhana. Biochemistry, RNA Structure. StatPearls Publishing, Treasure Island (FL), 2022.

  • [10] Runshun Zhang, Yinghui Wang, Baoyan Liu, Guangli Song, Xuezhong Zhou, Shizhen Fan, and Xishui Pan. Clinical data quality problems and countermeasure for real world study. Frontiers of Medicine, 8(3):352–357, 2014.


Proposal 2

Introduction

Turing Virtual Machines (TVM) does provide optimizations at multiple differ- ent levels. When importing a model, the first action taken at the first level is optimization within the graph. By applying the optimization, the graph obtains fusion, layer transformation and memory management. Furthermore, later ap- plied optimizations are applied on the tensor layer. A TVM does have many layers. The layer where the users interacts are called the user interface (UI) layer. This UI is constructed in python, and does have support data from mul- tiple frameworks, i.e. TensorFlow and PyTorch. This models are then converted into compatible TVM graphs. During the optimization of computations layer, the graph has been put through multiple optimizations, i.e. pre-computation, which makes the nodes within the graph to run ad compilation time. This new layout may add the necessary layout might add conversion operations for the layers within the graph. Afterwards a fusion performed to connects the oper- ators to a one kernel, however no intermediates are stored. A new cost based model preform automated optimizations for the low level program towards the hardware properties for a fast optimization of the code. The layer next in line preform optimizations for the specific hardware, and run space scheduling. The next step of optimizations is applied on the tensor layer. Tensors is an expression language, which are constructed to generate code automatically. At this stage of the TVM the graph is tensorized, which is an important stage for accelerators. It does also provide memory-scope and memory management. TVM supports a wast amount of optimizations for hardware, hence been imple- mented on embedded CPUs, GPUs, FPGAs to mention a few. This hardware produce a state-of-the art result for the specific hardware optimizations.

Current profiles of Convolutions Neural Networks (CNN) have been executed with the TVM compiler with successful results. These CNNs has been run- ning on hardware called Field-Programmable Gate Array (FPGA). FPGAs have much lower power consumption and have a higher performance for the consumed power. The reason is because a FPGA is hardware that is reconfigurable, i.e. the connections within the the FPGA can be programmed for a users desire.

There is a new extension of Neural Networks that have been introduced. It is Graph Neural Network (GNN). GNNs is constructed with nodes and edges. There are two different structures of graphs: structural and non-structural. The structural has is used for explicit applications i.e. knowlege graphs. Non- structural is the opposite to structural, implicit. The graph is first constructed before a taske is executed, i.e. connecting word for a text. A GNN do in- herit the properties as a normal graph has: Directed/Undirected, Homoge- neous/Heterogeneous, Dynamic/Static. The GNN has the same properties as graphs from graph theory.

Problem

The problem this study is designed to solve is constructed as following:

  • Profile the Graph Neural Network for the TVM compiler.

Context

Executing CNNs on FPGAs has been don with great success, both with speed and power consumption. CNNs have been applied to solve a wast range of complex tasks, i.e. image classification and object detection. However, they are not the best for modeling the real-world networks and behaviors that are complex like the GNN. This results in that GNNs are flexible, since they are flexible they can be applied i a wide area, i.e. online optimization. The biggest advantage of the pros of the GNN are there capability of generalization when applied.

There is publications regarding running CNN with the TVM compiler. However, to my knowledge there has not been any publication regarding profiling Graph Neural Network with the TVM compiler. However, GNNs has been used to model performance of a Deep Neural network (DNN). A machine learning model were implemented with a Halide and TVM to search for deep learning algorithm which has a valid implementation.

Goals and Challenges

Goal of this thesis is to create a profile of the Graph Neural Network (GNN) running on the TVM compiler. The code base of the profile will be stored within an open-source repository, free for anyone to use. The main programming language will be Python, however, there is a possibility that C++ might get used as well.

There is several challenges with this study, both within the theoretical and practical aspect. There is a need for conducting research related towards GNN. The practical part contains the actual part of profiling the GNN-model. Working with large machine-learning frameworks and locate where the code base for the GNN is also a challenge. Another challenge is that there are no available FPGA to test the code base on.

Approach

Methodology

I will represent a profile code base of the GNN for the TVM compiler. This code base will be represented in Python for ease of use. The profile will also be tested on a raspberry pie for as a demonstration.

Evaluation

To evaluate the profile, the profile must be executed with the different factors. To test the performance, and compare a GNN on a normal platform, i.e. laptop. It will also be tested on a raspberry pie, hopefully another FPGA as well for comparison. The test will contain the same data set, otherwise is it impossible to get a base-line. The interesting parts are: the memory, computations per minute, accuracy and perhaps power usage to calculate the effectiveness.

References

  • [1] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Car- los Guestrin, and Arvind Krishnamurthy. TVM End-to-End Optimizing: An automated End-to-End optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, Carlsbad, CA, October 2018. USENIX Asso- ciation.

  • [2] Seung–Hun Chung and Tarek S. Abdelrahman. Optimization of compiler- generated opencl cnn kernels and runtime for fpgas. In 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 100–103, 2022.

  • [3] Eriko Nurvitadhi, Jaewoong Sim, David Sheffield, Asit K. Mishra, Krish- nan Srivatsan, and Debbie Marr. Accelerating recurrent neural networks in analytics servers: Comparison of fpga, cpu, gpu, and asic. 2016 26th Inter- national Conference on Field Programmable Logic and Applications (FPL), pages 1–4, 2016.

  • [4] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.

  • [5] Shikhar Singh, Benoit Steiner, James Hegarty, and Hugh Leather. Using graph neural networks to model the performance of deep neural networks. arXiv preprint arXiv:2108.12489, 2021.

  • [6] Jose Suarez-Varela, Paul Almasan, Miquel Ferriol-Galmes, Krzysztof Rusek, Fabien Geyer, Xiangle Cheng, Xiang Shi, Shihan Xiao, Franco Scarselli, Al- bert Cabellos-Aparicio, and Pere Barlet-Ros. Graph neural networks for communication networks: Context, use cases and opportunities. IEEE Net- work, pages 1–8, 2022.

  • [7] Wei Sun, Savvas Sioutas, Sander Stuijk, Andrew Nelson, and Henk Corpo- raal. Efficient tensor cores support in tvm for low-latency deep learning. In 2021 Design, Automation Test in Europe Conference Exhibition (DATE), pages 120–123, 2021.

  • [8] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI Open, 1:57–81, 2020.