Sound Encryption Using Feature Extraction and Neural Network

In recent years, secure communication techniques have increased widely and unexpectedly . In order to establish reliable communication technology and to ensure that the data (sound) reaches its intended end and to be accessible to all through the shared network , there is a need to encrypt the transmitted information. The research was divided into three stages, the first stage included the process of extracting sound features for the file to be sent .In the second stage neural networks were used in the encryption process of the properties resulting from the first phase. In the final phase encryption algorithms were used to encrypt the result from the previous phase. The speech signal of male and female were coded and encrypted. The measures (SNR, PSNR, NRMSE) were used to improve the results. Besides that the Matlab were used as a programming language in this paper. فلمل ةیبصعلا تاكبشلاو صاوخلا صلاختسا مادختساب ریفشتلا تا توصلا صلختسملا تاونسلا يف ةریخلأا تاـینقت نیماـت يـف ةـعقوتم رـیغو ةـلئاه ةروصـب ماـمتهلاا دـیا زت تاـنایبلا لوصو نامضو لقانتلا ةیلمع قیقحت ضرغلو لاصتلاا ) توـص ( تاـهجلا ىـلا نوـــكت نا دـــبلا ناـــك ةـــبولطملا زرـــبت اـــنهو ةكرتشـــملا ةكبشـــلا رـــبع عـــیمجلا لواـــنتم يـــف ةلسرملا تامولعملا ریفشت ىلا ةجاحلا . صاوـــخ صلاختــسا ةـــیلمع ىــلولاا ةـــلحرملا تنمضــت لـــحا رم ثلاثـــب ثـــحبلا زاــجنا مــت توصــلا (features) تاكبشــلا مادختــسا مــت ةــیناثلا ةــلحرملا يــفو هلاــسرا دا رــملا فــلملل وـخلل ریفشـتلا ةـیلمع يــف ةیبصـعلا ةرـیخلاا ةـلحرملا اـما ىـلولاا ةـلحرملا نـم ةـجتانلا صا ریفشـتل ریفشـتلا تاـیمزراوخ ىدـحا مادختــسا مـت دـقف ةقباسـلا ةـلحرملا جئاـتن . مـت زـیمرت ریفشتو و لـجر توصـل تا راـشا و ةأ رـما اهدـعب مـت سییاـقملا مادختـسا ) SNR ، PSNR ، Lecturer\ College of Computers Sciences and Math.\ University of Mosul. * Lecturer\ College of Computers Sciences and Math.\ University of Mosul. ** Lecturer\ College of Computers Sciences and Math.\ University of Mosul. *** Received:1/10 /2011 ____________________Accepted: 21 /12 / 2011 Sound Encryption Using Feature Extraction... ] 292 [ NRMSE ( اـــــهتءافكو جئاـــــتنلا ةحـــــص تاـــــبثا ضرـــــغل نـــــع لاضـــــف ـــــمتعا مـــــت ،كـــــلذ دا Matlab ثحبلا اذه يف ةیجمرب ةغلك . 1.Introduction In this research, we proposed a new approach for encrypting and compressing sound signals using: · Linear predicative coding (LPC). · Elman neural network . · XOR Encoding. LPC is one of the most powerful speech analysis techniques, and one of the most useful methods for encoding good quality speech at a low bit rate. It provides extremely accurate estimates of speech parameters, and is relatively efficient for computation. It digitally encodes analog signals using a single-level or multilevel sampling system in which the value of the signal at each sample time is predicted to be a linear function of the past values of the quantized signal. LPC is related to adaptive predictive coding (APC) in that both use adaptive predictors. However, LPC uses more prediction coefficients to permit the use of a lower information bit rate than APC, and thus requires a more complex processor.[16] A particular type of neural network is the recurrent neural network. This network is a dynamical system, in which the output depends on the inputs and the internal state, which evolves with the network inputs. 2. Related work: Neural network used in several researches, [3] uses modular neural network(MNN) to identify the speaker by the characteristics extracted. but [10] uses the N.N for recognition and considering the case of speaker recognition by analyzing the sound signal with the help of intelligent techniques. However [6] uses mean and variance of the discrete wavelet transform in addition to other features that have been used previously for audio classification and used multilayer perceptron (MLP) neural networks as a classifier .The aim of [1] is to develop a system for encoding good quality speech at a low bit rate using linear predictive coding also, [9] used the linear predictive for better interpretation of spoken words. While, our research has Neural Network as well as LPC to encrypt the sound. The Fourth Scientific Conference of the College of Computer Science & Mathematics ] 293 [ 3.Why Use LPC Under normal circumstances, speech is sampled at 8000 samples/second with 8-bits used to represent each sample. This provides a rate of 64000 bits/second. Linear predictive coding reduces this to 2400 bits/second by breaking the speech into segments and then sending the voiced/unvoiced information, the pitch period and the coefficients for the filter that represent the vocal tract for each segment. At this reduced rate the speech has a distinctive synthetic sound and there is a noticeable loss of quality. However, the speech is still audible and it can be easily understood. Since there is information loss in linear predictive coding, it is a lossy form of compression. This low bit is also appealing to government because of its resistance to jamming and channel noise.[17] 4.LPC Modeling A. Physical Model: When you speak: · Air is pushed from your lung through your vocal tract and out of your mouth becomes speech. · For certain voiced sound, your vocal cords vibrate (open and close). The rate at which the vocal cords vibrate determines the pitch of your voice. Women and young children tend to have high pitch (fast vibration) while adult males tend to have low pitch (slow vibration). · For certain fricatives and plosive (or unvoiced) sounds, your vocal cords do not vibrate but remain constantly opened. Figure (1): Physical Model Sound Encryption Using Feature Extraction... ] 294 [ · The shape of your vocal tract determines the sound that you make. · As you speak, your vocal tract changes its shape producing different sounds. · The shape of the vocal tract changes relatively slowly (on the scale of 10 msec to 100 msec). · The amount of air coming from your lung determines the loudness of your voice.[1] B. Mathematical Model: · The above model is often called the LPC Model. · The model says that the digital speech signal is the output of a digital filter (called the LPC filter) whose input is either a train of impulses or a white noise sequence. · The relationship between the physical and the mathematical models:[1][16] Vocal Tract H(z) Û (LPC Filter) Air u(n) Û (Innovations) Vocal Cord Vibration V Û (voiced) Vocal Cord Vibration Period Û T (pitch period) Fricatives and Plosives Û UV (unvoiced) Air Volume Û G (gain) The LPC filter is given by: Figure (2): Mathematical Model The Fourth Scientific Conference of the College of Computer Science & Mathematics ] 295 [ which is equivalent to saying that the input-output relationship of the filter is given by the linear difference equation:[1] The LPC model can be represented in vector form as: · changes every 20 msec or so. At a sampling rate of 8000 samples/sec, 20 msec is equivalent to 160 samples. · The digital speech signal is divided into frames of size 20 msec. There are 50 frames/second. The model says that:


1.Introduction
In this research, we proposed a new approach for encrypting and compressing sound signals using: • Linear predicative coding (LPC).
LPC is one of the most powerful speech analysis techniques, and one of the most useful methods for encoding good quality speech at a low bit rate.It provides extremely accurate estimates of speech parameters, and is relatively efficient for computation.
It digitally encodes analog signals using a single-level or multilevel sampling system in which the value of the signal at each sample time is predicted to be a linear function of the past values of the quantized signal.LPC is related to adaptive predictive coding (APC) in that both use adaptive predictors.However, LPC uses more prediction coefficients to permit the use of a lower information bit rate than APC, and thus requires a more complex processor.[16] A particular type of neural network is the recurrent neural network.This network is a dynamical system, in which the output depends on the inputs and the internal state, which evolves with the network inputs.

Related work:
Neural network used in several researches, [3] uses modular neural network(MNN) to identify the speaker by the characteristics extracted.but [10] uses the N.N for recognition and considering the case of speaker recognition by analyzing the sound signal with the help of intelligent techniques.However [6] uses mean and variance of the discrete wavelet transform in addition to other features that have been used previously for audio classification and used multilayer perceptron (MLP) neural networks as a classifier .The aim of [1] is to develop a system for encoding good quality speech at a low bit rate using linear predictive coding also, [9] used the linear predictive for better interpretation of spoken words.While, our research has Neural Network as well as LPC to encrypt the sound.

3.Why Use LPC
Under normal circumstances, speech is sampled at 8000 samples/second with 8-bits used to represent each sample.This provides a rate of 64000 bits/second.Linear predictive coding reduces this to 2400 bits/second by breaking the speech into segments and then sending the voiced/unvoiced information, the pitch period and the coefficients for the filter that represent the vocal tract for each segment.At this reduced rate the speech has a distinctive synthetic sound and there is a noticeable loss of quality.However, the speech is still audible and it can be easily understood.Since there is information loss in linear predictive coding, it is a lossy form of compression.This low bit is also appealing to government because of its resistance to jamming and channel noise.B. Mathematical Model: • The above model is often called the LPC Model.
• The model says that the digital speech signal is the output of a digital filter (called the LPC filter) whose input is either a train of impulses or a white noise sequence.• The relationship between the physical and the mathematical models: Fricatives and Plosives Û UV (unvoiced) The LPC filter is given by: which is equivalent to saying that the input-output relationship of the filter is given by the linear difference equation: [1] The LPC model can be represented in vector form as: • changes every 20 msec or so.At a sampling rate of 8000 samples/sec, 20 msec is equivalent to 160 samples.There's almost no perceptual difference in if: • For Voiced Sounds (V): the impulse train is shifted (insensitive to phase change).• For Unvoiced Sounds (UV):} a different white noise sequence is used.
LPC Synthesis: Given , generate (this is done using standard filtering techniques).LPC Analysis: Given , find the best (this is described in the next section).

2.4kbps LPC Vocoder
The following is a block diagram of a 2.4 kbps LPC Vocoder: • The LPC coefficients are represented as line spectrum pair (LSP) parameters.• LSP are mathematically equivalent (one-to-one) to LPC.
• LSP are more amenable to quantization.
• LSP are calculated as follows: [11] • Factoring the above equations, we get: are called the LSP parameters.
• LSP are ordered and bounded: • LSP are more correlated from one frame to the next than LPC.
• The frame size is 20 msec.There are 50 frames/sec.2400 bps is equivalent to 48 bits/frame.These bits are allocated as follows: Thus, the recurrent contexts provide a weighted sum of the previous values of the hidden units as input to the hidden units.As shown in Figure ( 4), the activations are copied from hidden layer to context layer on a one for one basis, with fixed weight of 1.0 (w=1.0).The forward connection weight is trained between  hidden units and context units as well as other weights.If selfconnections are introduced to the context unit when the values of the self-connections weights (a) are fixed between 0.0 and 1.0 (usually 0.5) before the training process, it is an improved ENN as proposed [2].When weights (a) are 0, the network is the original ENN.
From Figure (5) we can see that training such a network is not straightforward since the output of the network depends on the inputs and also all previous inputs to the network.So, it should trace the previous values according to the recurrent connections(Figure 4).So, the calculation of the functional derivatives is not straightforward and it leads to low efficiency to deal with various signal problems.Figure (6) shows that a long ENN where by a back propagation is used to calculate the derivatives of the error (at each output unit) by unrolling the network to the beginning.At the next time step t+1 input is represented.The context units contain values which are exactly the hidden unit values at time t (and the time t-1, t-2 …) and these context units provide the network with memory [8].Therefore, the ENN network is converted into a dynamical network that is efficient in the use of temporal information of the input sequence, both for classification as well as for prediction [7][14].However, the efficiency of the ENN is limited to low order system due to the insufficient calculation of the derivatives in some degree.o is the j-th unit of the actual output pattern produced by the presentation of input pattern p at the time k, and j indexes all the output units [12]]].

XOR-Encryption
Exclusive-OR encryption works by using the Boolean algebra function exclusive-OR (XOR).XOR is a binary operator (meaning that it takes two arguments -similar to the addition sign, for example).By its name, exclusive-OR, it is easy to infer (correctly, no less) that it will return true if one, and only one, of the two operators is true.The logical operation exclusive disjunction, also called exclusive or (symbolized XOR or EOR), is a type of logical disjunction on two operands that results in a value of true if exactly one of the operands has a value of true.A simple way to state this is "one or the other but not both."[18]

8.Suggestion methods
In this research a wave-type audio signal was used , and the function of Linear predictive coding encoders is to break up the sound signal into different segments and then send information on each segment to the decoder.The encoder sends information on whether the segment is voiced or unvoiced and the pitch period for the voiced segment which is used to create an excitement signal in the decoder.Four matrices (coefficients, voiced, pitch and gain) were fed into the Elman neural network to train it to obtain the values of hidden nod and weight matrix.To ensure addition confidentiality the hidden nod and weight matrix was encrypted using the XOR encoder .Finally, the output was sent. Figure (7)

9.Experimental Result
The program of the suggested method was tested on male and female speech files recorded using a microphone on a PC.The first (male) speech signal was sampled at 8000 samples/second and quantized at 8 bits/sample.Approximately 8 seconds of speech.Figure (9) shows a sample of original and recovered speech message, the second speech signal(female) sampled at 8000 samples/second and quantized at 8 bits/sample.Approximately 3 seconds of speech.Figure (10) shows a sample of original and recovered speech message and the third speech signal (female) sampled at 16000 samples/second and quantized at 16 bits/sample.Approximately 3.01 seconds of speech.Figure (11) shows a sample of original and recovered speech message.The quality of the recovered speech signal (male, female) is measured by using SNR and PSNR as shown in table (3).l ev el an d secu rity becau se we sen d th e bi ts of hum an production of sound instead of transmitting an estimate of the sound wave 3. Using facilities of Elman neural network we increase the security by using 10 nods in the hidden layer.Finally we used XOR Encoding after dividing the result into two blocks.4.After executing the above methods, it is concluded that the method is better and has a good performance for encrypted male speech signal , as it's clear from table (3) because women tend to have high pitch (fast vibration) while males tend to have low pitch (slow vibration).
[17] 4.LPC Modeling A. Physical Model: When you speak: • Air is pushed from your lung through your vocal tract and out of your mouth becomes speech.• For certain voiced sound, your vocal cords vibrate (open and close).The rate at which the vocal cords vibrate determines the pitch of your voice.Women and young children tend to have high pitch (fast vibration) while adult males tend to have low pitch (slow vibration).• For certain fricatives and plosive (or unvoiced) sounds, your vocal cords do not vibrate but remain constantly opened.

•
The digital speech signal is divided into frames of size 20 msec.There are 50 frames/second.The model says that: is equivalent to Thus the 160 values of are compactly represented by the 13 values of .[5][13] [15] 5.LPC AnalysisConsider one frame of speech signal: The signal is related to the innovation through the linear difference equation: ……..….(1) ……..….(2) ……..….(3) ……..….(4) ……..….(5) Sound Encryption Using Feature Extraction… ] 296 [ The ten LPC parameters are chosen to minimize the energy of the innovation: Using standard calculus, we take the derivative of with respect to and set it to zero: We now have 10 linear equations with 10 unknowns:[16] where 5.1 Levinson-Durbin Recursion: Solve the above for , and then set To get the other three parameters , we solve for the innovation: Then calculate the autocorrelation of : ……..….(6) ……..….(7) ……..….(8)The Fourth Scientific Conference of the College of Computer Science & Mathematics ] 297 [Then make a decision based on the autocorrelation.[17]

Figure ( 5 )
Figure (5): Internal Process Analysis of ENN target value (desired output) of the j-th component of the output for pattern p andk pj

Conference of the College of Computer Science & Mathematics
shows the flowchart of encryption algorithm.Multiply node by weight to get the gain, voiced, pitch &coefficient vector Decode the value of pitch vector by Run length algorithm Decode gain, voiced, pitch &coefficient vector using LPC Method

Table ( 3
) Results of applying the performance measures 10.Conclusions1.Linear Predictive Coding achieves high compression rateby coding each 64000 bits/ second to bit rate of 2400 bits/second.2. By using Linear Predictive Coding we achieve a coding