#] 
#] *********************
#] "$d_Refs"'Mathematics/8_[NN, genetics] [nomenclature, math].txt'
# www.BillHowell.ca  10Feb2024 initial 
# view in text editor, using constant-width font (eg courier), tabWidth = 3

#48************************************************48


#24************************24
# Table of Contents, generate with :
# $ grep  "^#]"  "$d_Refs"'Mathematics/8_[NN, genetics] [nomenclature, math].txt' |  sed "s/^#\]/  /" 
#
   *********************
   "$d_Refs"'Mathematics/8_[NN, genetics] [nomenclature, math].txt'
   ??Feb2024 
   ??Feb2024 
   ??Feb2024 
   +-----+
   10Feb2024 IJCNN2024 review paper stuff 
   Nomenclature 
   Gene examples (oldies but goodies)
   global independent feature extractor
      KmerEmbedding:
      Deep Neural Nets (DNN)
   k-mer encoding
   one-hot extraction


#24************************24
# Setup, ToDos,   


#08********08
#] ??Feb2024 


#08********08
#] ??Feb2024 


#08********08
#] ??Feb2024 


#08********08
#] +-----+
#] 10Feb2024 IJCNN2024 review paper stuff 
"$d_web"'Neural nets/Paper reviews/<yymmdd> [journal, conference] paper review- math only.txt'

#] Nomenclature 

lncRNAs		long non-coding RNA endogenous single-stranded polynucleotides 
				with a sequence length >=200 nucleotides that does not encode proteins
ncRNA			non-coding RNA

LPIGLAM		LPI prediction based on [global, local] features of lncRNA and protein
LPI			lncRNA-protein interactions


#] Gene examples (oldies but goodies)

p53 & H19 	interplay has major roles in tumorigenesis and metastasis?
	H19		tumorigenesis, but also crucial to embryonic development
				one of the first discovered lncRNAs 
	p53		tumor suppressor, represses the H19 gene
	are mutually counter-regulated :
	|->	P53 represses the H19 gene
	|->	H19-derived miR-675 inhibits p53 and p53-dependent protein expression

HOTAIR		HOX Transcript Antisense Intergenic RNA
	|->	PRC2		H3K27-methylation
	|->	LSD1		H3K4-demethylation [7]


#] global independent feature extractor

#]    KmerEmbedding:

We use k-mer features to encode lncRNA and protein to capture the global characteristics of sequences. The k-mer features transform variable-length sequences into fixed-length feature vectors. 

For lncRNA sequences
	we calculate the corresponding nucleotide frequencies (A, U, G, C) to fully extract features. 
	Then, we take combinations of k = 1, 2, 3, and 4 to obtain a 340-dimensional feature vector.
For proteins, based on dipole moment and side chain volume, 
	we divide the 20 amino acids into 7 groups: {Ala, Gly, Vlal}, {Ile, Leu, Phe, Pro}, {Thr, Met, Tyr, Ser}, {His, Asn, Tpr, Gln}, {Arg, Lys}, {Glu, Asp}, {Cys} [27]. 
	We take combinations of k = 1, 2, and 3 to calculate the frequency of protein sequences, resulting in a 399-dimensional feature vector.

#]    Deep Neural Nets (DNN)
As k-mer features already contain higher-level information, we utilize a simple deep neural network to extract features, employing LeakyReLU to prevent the vanishing gradient problem and dropout to address overfitting issues.

(11)	Lglobal = DNN(1, Lkmer) 
(12)	Pglobal = DNN(2, Pkmer) 
where 
	DNN 1 and DNN 2 are deep neural networks constructed by stacking multiple layers in the arrangement of dropout layer, fully connected layer, and LeakyReLU activation function.


#08********08
#] k-mer encoding
10Feb2024 

https://en.wikipedia.org/wiki/K-mer
In bioinformatics, k-mers are substrings of length k {\displaystyle k} contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which k-mers are composed of nucleotides (i.e. A, T, G, and C), k-mers are capitalized upon to assemble DNA sequences,[1] improve heterologous gene expression,[2][3] identify species in metagenomic samples,[4] and create attenuated vaccines.[5] Usually, the term k-mer refers to all of a sequence's subsequences of length k {\displaystyle k}, such that the sequence AGAT would have four monomers (A, G, A, and T), three 2-mers (AG, GA, AT), two 3-mers (AGA and GAT) and one 4-mer (AGAT). More generally, a sequence of length L {\displaystyle L} will have L − k + 1 {\displaystyle L-k+1} k-mers and n k {\displaystyle n^{k}} total possible k-mers, where n {\displaystyle n} is number of possible monomers (e.g. four in the case of DNA). 


#08********08
#] one-hot extraction
10Feb2024   

https://en.wikipedia.org/wiki/One-hot
In digital circuits and machine learning, a one-hot is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0).[1] A similar implementation in which all bits are '1' except one '0' is sometimes called one-cold.[2] In statistics, dummy variables represent a similar technique for representing categorical data.


# enddoc