Here I report an issue with a high performance degradation when multiply a vector with 2d-tensor in column-wise (254 tops measured) than we do it in row-wise (419 tops measured). it is int8 matmul ...
The work around is to also break the connection to the float before assigning a new vector, then re-connecting the float node to the second input on multiply. It seems like the type matching test ...