how to mask specific bits in a signed fixed point number?

8 views (last 30 days)
I have been trying to emulate a simple multiplier with fixed point inputs and output. I would like to mask the last 4 bits of the inputs and test the output. I tried using bitand() funtion, but it only accepts integer values. What can i do in the case of fixed point decimal values?
for example:
a = -2.345 and b = 0.2755 (with 16-b fixed point quantization)
c = a * b; (output also quantized to 16 bits)
I want to mask the last 4 bits of a and b, observe the output c . What function should i use?
Thanks!

Accepted Answer

Andy Bartlett
Andy Bartlett on 11 Apr 2022
Edited: Andy Bartlett on 11 Apr 2022
Full-precision multiply
A full-precision multiplication of the 16 bit inputs can be done like so
format compact
format long
fp = fipref;
fp.NumericTypeDisplay = 'short';
% Set a and b using signed 16 bits
% assuming a and b are constants
% set best-precision scaling based on the values
%
isSigned = 1;
nBits = 16;
a = fi(-2.3451220703125,isSigned,nBits)
disp(a.bin)
b = fi( 0.2763544921875,isSigned,nBits)
disp(b.bin)
% Full precision multiply
yFullPrecProduct = a .* b
disp(yFullPrecProduct.bin)
which outputs
a =
-2.345092773437500
numerictype(1,16,13)
1011010011110101
b =
0.276351928710938
numerictype(1,16,16)
0100011010111111
yFullPrecProduct =
-0.648070910945535
numerictype(1,32,29)
11101011010000110000000011001011
Notice that the output of 16 bits times 16 bits is 32 bits.
Reduced precision output from multiply
Reducing the size of a fixed-point multiplication's output begs a critical question of which bits to keep in the reduced precision output and which bits to discard.
Depending on the answer, there will also be a questions about how to handle overflows or rounding or both. If the full-precision output is signed, you may also need to decide if you want the reduced precision output to remain signed or change to unsigned.
Here is an example of keeping the most significant 16 bits
ntc = numerictype(c);
nBitsY1 = 16;
nPrecisionBitsToDrop = 16;
fmSatFloor = fimath('RoundingMethod', 'Floor', ...
'OverflowAction', 'Saturate');
nty1 = numerictype( ...
ntc.SignednessBool,...
nBitsY1, ...
ntc.FractionLength - nPrecisionBitsToDrop);
y1 = fi(yFullPrecProduct,nty1,fmSatFloor);
y1 = removefimath(y1)
disp(y1.bin)
which outputs
y1 =
-0.648071289062500
numerictype(1,16,13)
1110101101000011
Here is an example keeping the least significant 16 bits
ntc = numerictype(c);
nBitsY2 = 16;
nPrecisionBitsToDrop2 = 0;
fmSatFloor = fimath('RoundingMethod', 'Floor', ...
'OverflowAction', 'Saturate');
fmWrapFloor = fimath('RoundingMethod', 'Floor', ...
'OverflowAction', 'Wrap');
nty2 = numerictype( ...
ntc.SignednessBool,...
nBitsY2, ...
ntc.FractionLength - nPrecisionBitsToDrop2);
y2sat = fi(yFullPrecProduct,nty2,fmSatFloor);
y2sat = removefimath(y2sat)
disp(y2sat.bin)
y2wrap = fi(yFullPrecProduct,nty2,fmWrapFloor);
y2wrap = removefimath(y2wrap)
disp(y2wrap.bin)
which outputs
y2sat =
-6.103515625000000e-05
numerictype(1,16,29)
1000000000000000
y2wrap =
3.781169652938843e-07
numerictype(1,16,29)
0000000011001011
Notice that two different outputs were computed.
One that handles overflow by saturating. In this case, it saturated to the most negative representable value of the final output type.
The other that handles overflow by wrapping which means just throwing away the dropped most significant bits and always keeping the lower significant bits verbatim.
Masking bits
Masking bits to force certain bits to be zero and/or certain bits to be ones can be done in C, MATLAB, and Simulink using bit-wise AND and bit-wise OR. Functions or Simulink blocks for bit set and bit clear can also be used.
In MATLAB, the functions bitand and bitor are available. When using these with fixed-point fi objects, both arguments must have identical types, so that requires a little bit of care.
This function provides an example of using bitand to force the n least significant bits of the input to be zero.
function y = bitClearLSB(u,nBits)
%bitClearLSB clear the n least significant bits of input
%
% Usage:
% y = bitClearLSB(u,nBits)
% Inputs
% u is any fixed-point or integer variable
% nBits a non-negative integer value (defaults to 1)
%
% Copyright 2022 The MathWorks, Inc.
%#codegen
if nargin < 2
nBits = 1;
end
assert(...
numel(nBits)==1 && isequal(size(nBits),size(u)),...
'nBits must be scalar or same size as u.')
assert(...
all((nBits >= 0) & (nBits == floor(nBits)) & isfinite(nBits)),...
'nBits must be a non-negative integer value.')
% Built-in integers will be handled using equivalent fi object
%
u1 = castIntToFi(u);
assert(isfi(u1) && isfixed(u1), 'u must be integer or fixed-point.')
ntu1 = numerictype(u1);
% Create raw bit mask with all ones in bit positions to keep as is
% and all zeros in bit positions to clear
% Example for word length of 8 bits
% nBits rawBitMask
% 0 1111
% 1 1110
% 2 1100
% 3 1000
% 4 0000
%
wl = ntu1.WordLength;
ntRawBits = numerictype(0,wl,0);
rawBitMask = repmat( upperbound(ntRawBits), size(nBits) );
rawBitMask(:) = bitsll(rawBitMask,nBits);
% bitand for fi requires both types to be identical
% including fimath properties
% so reinterpret bitMask
% then set fimath
%
bitMask = reinterpretcast(rawBitMask,ntu1);
bitMask = setfimath(bitMask,fimath(u1));
y1 = bitand(u1,bitMask);
% if built-in integer cast back to that type
%
y = cast(y1,'like',y1);
end
Here is an example of applying that to a variable.
format compact
format long
fp = fipref;
fp.NumericTypeDisplay = 'short';
% Set a and b using signed 16 bits
% assuming a and b are constants
% set best-precision scaling based on the values
%
isSigned = 1;
nBits = 16;
b = fi( 0.2763544921875,isSigned,nBits)
disp(b.bin)
% Clear 4 LSBs of b
%
nBitsClear = uint8(4);
b1 = bitClearLSB(b,nBitsClear)
disp(b1.bin)
which outputs
b =
0.276351928710938
numerictype(1,16,16)
0100011010111111
b1 =
0.276123046875000
numerictype(1,16,16)
RoundingMethod: Nearest
OverflowAction: Saturate
ProductMode: FullPrecision
SumMode: FullPrecision
0100011010110000
The generated C code for the bit clearing operation will be simple like the following
void myFunc(int16_T a, unsigned char nBitsClear, int16_T *y1)
{
int16_T tmp_bit_mask;
tmp_bit_mask = 65535 << nBitsClear;
*y2 = a & tmp_bit_mask;
}
Hopefully, this example gives you enough of an idea to craft whatever bit masking operation you are seeking.
Then combing that with the multiplication examples above should allow you to figure out a solution to your overall problem.
Consider casting
Since your high level goal involved multiplication, bit masking might not be the simplest way to achieve your goal. If your goal is to get rid of a certain number of most significant bits or least significant bits, you might want to consider using casting.
Consider the example given above of keeping the most significant 16 bits of variable (that happend to be a multiplication product). That dropped the least significant 16 bits of the input. Mathematically, that is equivalent keeping the output 32 bits but using masking such that the least significant 16 bits are all zeros.
Downcasting to 16 bits can be easier to think about and model than doing the bit masking. A big benefit is that subsequent operations can be more efficient. For example, bit masking then doing a 32 bit by 32 bit multiplication producing a 64 bit ideal product is less efficient than downcasting to 16 bits, then doing a 16 bit by 16 bit multiplication that produces a 32 bit ideal product.
  3 Comments
Andy Bartlett
Andy Bartlett on 12 Apr 2022
Here is why downcast to smaller types before the multiplication is better than multiplying in a bigger type were the least significant bits have been set to zero.
Microcontroller
Suppose you are targeting an ARM microcontroller for deployment of your embedded design.
An multiply instruction with a 32 bit output takes 1 or 2 clock cycles.
In contrast a multiply instruction with a 64 bit output takes 3 to 7 clock cycles.
So the 64 bit multiply is at best 50% slower. Depending on the specific instruction needed it could be 100% to 600% slower.
FPGA
Alternately, suppose your are targeting and FPGA with DSP48E math slice.
The DSP48E can do a full precision multiplication up to 25 bits by 18 bits.
So a 16 bit by 16 bit multiplication can fit in just one DSP48E slice and get the full speed advantages of that hardened and optimized circuit.
In contrast, a 32 by 32 bit multiplication would not fit in one DSP48E. More FPGA resources would be need to partition and coordinate the math. The pieces would need to be coordinated across multiple clock cycles thus being slower too.
ASIC
For ASIC, the transistors needed for an n-bit by n-bit multiply is on the order of n^2 bits. So the 32 bit by 32 bit multiply would require about 4 times as many transistors.
These digital circuits are really analog underneath. The clock cycle needs to be slow enough for these analog circuits to stabilize to their "digital" high or low voltages before starting the next calculation. The more complicated the combinatorial circuit the longer the settling time. Think of a addition with carry having to propagate bit level carries across a 32 bit output vs a 64 bit output. So a slower clock rate is one way to handle the bigger multiply. An alternative is to break up pieces of the calculation and pipeline them. The smaller pieces will have a faster settling time, so the clock can be faster. But the latency will be longer because the full multiply operation must wait multiple clock cycles for the calculation to be fully done.

Sign in to comment.

More Answers (0)

Products


Release

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!