'memcpy'-like function that supports offsets by individual bits?

I was thinking about solving this, but it's looking to be quite a task. If I take this one by myself, I'll likely write it several different ways and pick the best, so I thought I'd ask this question to see if there's a good library that solves this already or if anyone has thoughts/advice.

void OffsetMemCpy(u8* pDest, u8* pSrc, u8 srcBitOffset, size size)
    // Or something along these lines. srcBitOffset is 0-7, so the pSrc buffer 
    // needs to be up to one byte longer than it would need to be in memcpy.
    // Maybe explicitly providing the end of the buffer is best.
    // Also note that pSrc has NO alignment assumptions at all.

My application is time critical so I want to nail this with minimal overhead. This is the source of the difficulty/complexity. In my case, the blocks are likely to be quite small, perhaps 4-12 bytes, so big-scale memcpy stuff (e.g. prefetch) isn't that important. The best result would be the one that benches fastest for constant 'size' input, between 4 and 12, for randomly unaligned src buffers.

  • Memory should be moved in word sized blocks whenever possible
  • Alignment of these word sized blocks is important. pSrc is unaligned, so we may need to read a few bytes off the front until it is aligned.

Anyone have, or know of, a similar implemented thing? Or does anyone want to take a stab at writing this, getting it to be as clean and efficient as possible?

Edit: It seems people are voting this "close" for "too broad". A few narrowing details would be AMD64 is the preferred architecture, so lets assume that. This means little endian etc. The implementation would hopefully fit well within the size of an answer so I don't think this is too broad. I'm asking for answers that are a single implementation at a time, even though there are a few approaches.


I would start with a simple implementation such as this:

inline void OffsetMemCpy(uint8_t* pDest, const uint8_t* pSrc, const uint8_t srcBitOffset, const size_t size)
    if (srcBitOffset == 0)
        for (size_t i = 0; i < size; ++i)
            pDest[i] = pSrc[i];
    else if (size > 0)
        uint8_t v0 = pSrc[0];
        for (size_t i = 0; i < size; ++i)
            uint8_t v1 = pSrc[i + 1];
            pDest[i] = (v0 << srcBitOffset) | (v1 >> (CHAR_BIT - srcBitOffset));
            v0 = v1;            

(warning: untested code!).

Once this is working then profile it in your application - you may find it's plenty fast enough for your needs and thereby avoid the pitfalls of premature optimisation. If not then you have a useful baseline reference implementation for further optimisation work.

Be aware that for small copies the overhead of testing for alignment and word-sized copies etc may well outweigh any benefits, so a simple byte by byte loop such as the above may well be close to optimal.

Note also that optimisations may well be architecture-dependent - micro-optimisations which give a benefit on one CPU may well be counter-productive on another.

Need Your Help

Get JSON contents from Ruby on Rails server using AJAX

javascript ruby-on-rails ajax json

I have some experience with RoR, although, I'm kind of a newbie in AJAX matters. My goal is to have a HTML5 + JavaScript client and a Ruby on Rails server running in different machines (eventually).

Adding uitextfields dynamically

ios uitextfield

am having UIActionSheet and 4 buttons on that.

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.