Feature engineering

Modified on Fri, 26 Jul at 1:08 PM

TABLE OF CONTENTS

Introduction

Feature engineering is the process of crafting new features from the existing features. These new features can be created through domain knowledge, purely algorithmically, or with a mix of both. For example, principal component analysis (PCA) creates a reduction by computing components, which can be seen as new features of the data. PCA creates these components by trying to explain the most amount of variation in the data, hence these components are created purely algorithmically. In biological data science, it would be useful to integrate the algorithmic approach with domain knowledge, to get even more insight from the data. For example, pathways are a series of interactions among molecules in a cell that leads to a certain product or a change in that cell. By using domain knowledge, we know that certain features (enzymes) belong to a certain pathway. By creating these pathway sets, we could extract the overall activity of the pathway from the original features themselves, which means converting your features x observations data matrix (e.g. genes x cells) in a sets x observations data matrix (e.g. pathways x cells). Then we could use these newly created engineered data to perform the analyses present in the UniApp.

The UniApp enables you to perform the set variation analysis (GSVA), which converts domain knowledge (the sets you define, like pathways) to an analysis-ready data matrix by assessing the relative enrichment of sets across observations using a non-parametric approach. You can also upload any custom engineered data generated with any other method, as long as the observation names of the custom engineered data correspond to the ones of the original data.

1 Algorithm settings:


1.1 Creating a plot




As a first step of the analysis, a plot must be created by clicking on the create plot icon. This will lead you to a section where the analysis of interest (in this case gene set variation analysis) can be selected.



The analysis can be effectively organised by assigning a name and description in the respective fields. Subsequently under the "choose algorithm to run your analysis", gene set variation must be selected.


1.2 Selecting data


Next the input analysis can be selected into the track element under "choose track element". In the cell selection tab you can choose the observations to use as input. For more information see the section on Cell/sample selectionNote that subsetting at the pretreatment step is a "hard" subset meaning that excluded cells/samples at this step will not be present in the downstream steps.


1.3 Setting paramaters


In the set parameters field you will be able to define how to perform the gene set variation analysis. 


 

In the Algorithm box you will be able to define how to perform GSVA:

  • Gene set: choose gene set.
  • Parameters: sets parameters for GSVA.
  • Advanced: subsample cells.


1.3.1 Gene set


GSVA will convert the input original features (genes) values to gene set scores selected in the Gene set tab. Currently the UniApp supports gene sets hosted on MsigDB and KEGG that  are annotated by domain experts. Addiotionally you can also provide a your own, custom gene set.  

  

The options in the Gene set tab are:

  • Gene set: select gene set or sets to use in the set enrichment analysis. Currently you can choose between sets from MsigDB, KEGG or upload a custom set. 
  • Gene subset: select a subset of from the previously selected set.


1.3.2Parameters


For the set variation analysis, you can define the following parameters:

  • Minimum set size: if the number of elements in a set is found to be lower than this value, the set will be excluded from the analysis. This can happen since not all features of a set could be present in the data.
  • Kernel: which distribution to use to estimate the cumulative density function (CDF) for every selected feature and observation in the data. The choice is between Gaussian, Poisson and None. The default Gaussian is suitable when the input values are continuous (e.g. log transformed, which is usually the case in the UniApp). Poisson is suitable when the input values are integer counts (mostly not applicable in the UniApp). None estimates the CDF directly from the data itself, which can be useful when the number of observations is high, since it is much faster.
Most of the times you should use the Gaussian kernel, unless you skipped the normalization in the Pretreatment step. If normalization was performed, the data will be continuous, hence Gaussian should be used.


2 Performing GSVA


When the parameters are all set-up, you can click on the Run button to compute the gene set variation analysis. This could take quite some time, depending on the feature engineering method and the size of the data (it could take several hours for extremely big datasets if you use the Set variation analysis).

As soon as the analysis is computed, the information table and the preview table will be updated. At this point, the newly computed engineered data is ready to be used in the downstream analyses.

If the GSVA was performed on a subset of the data (which is decided in the Design of experiment step), then all the observations which were not used during the computation will be added to the feature engineered data anyway, with all the values set to 0. This is what the Missing observations entry in the information table is showing. If you have some missing observations, you need to be careful when performing the downstream analyses (by not select the missing observations, since they are all 0s).


3 Using engineered data matrix as input

After the engineered data matrix is created you can use it in almost any other UniApp analysis module that uses a normal matrix. The engineered matrix will appear as new track element that can be select as input when creating new algorithms. 


3.1 Dimension reduction 

You can color code your dimension reduction plot for gene set expression by selecting an engineered data matrix as input for the Data type to plot:

Now the dimension reduction plot will color coded for the gene set score specified in the Select engineered feature:


3.2 Marker set analysis

You can effectively turn the Marker gene module into a Marker gene set module by selecting an engineered data matrix as input:


In the resulting table you can see gene sets upregulated in certain observations:




4 Video tutorial

Expected soon

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article