Keith Smiley | About | RSS

Reproducible codesigning on Apple Silicon

For people who expect reproducible builds, Apple Silicon machines provide an interesting challenge. Apple Silicon requires arm64 binaries, including command line tools you build yourself, be codesigned. This change is mostly transparent to developers, because Apple updated their linker to automatically ad-hoc sign binaries1. Unfortunately, if you're interested in producing binaries that support both Intel Macs and Apple Silicon Macs, you likely want to produce a fat binary. When codesigning this binary you hit some behavior that depends on your current machine's architecture.

Example

You can consistently produce the same result across multiple machines when compiling a binary without signing it. Here's an example with a simple C program:

$ echo "int main() { return 0; }" > main.c
$ clang main.c -Wl,-no_adhoc_codesign -arch arm64 -arch x86_64 -o main
$ shasum main
113033b3d9a247210b49a476bbfadb2e347846fe  main

The shasum of main should always be the same regardless of your host machine2. On Apple Silicon machines you can see this binary has the same sha1 even if you run clang under Rosetta 23:

$ arch -x86_64 clang main.c -Wl,-no_adhoc_codesign -arch arm64 -arch x86_64 -o main
$ shasum main
113033b3d9a247210b49a476bbfadb2e347846fe  main

The issue is introduced when you codesign the binary on Apple Silicon machines versus Intel machines. You can immediately see the difference3:

$ codesign --force --sign - main
$ shasum main
84631e812bd480c306766ba03a728dd2565dd672  main
% arch -x86_64 codesign --force --sign - main
% shasum main
f631b6c0daf3ffd0bb5f65d19fa045acf447a72d  main

We get closer to identifying the problem when you compare the details of these differences:

$ codesign --force --sign - main
$ codesign -dvvv main > arm.txt 2>&1
$ arch -x86_64 codesign --force --sign - main
$ codesign -dvvv main > intel.txt 2>&1
$ diff -Nur intel.txt arm.txt
--- intel.txt   2021-10-05 21:26:32.731918710 -0700
+++ arm.txt     2021-10-05 21:26:29.473702845 -0700
@@ -1,14 +1,14 @@
 Executable=/private/tmp/main
-Identifier=main-555549445413bc88d3b13dfa855a7f47cf229020
+Identifier=main-55554944e4472179d9ec3bff92ff3f8e6d013184
 Format=Mach-O universal (x86_64 arm64)
 CodeDirectory v=20400 size=358 flags=0x2(adhoc) hashes=5+2 location=embedded
 Hash type=sha256 size=32
-CandidateCDHash sha256=78fab81d4fef72ae942e3272ba2c9085e1893828
-CandidateCDHashFull sha256=78fab81d4fef72ae942e3272ba2c9085e1893828a77095ae517ab7d3b8229ad0
+CandidateCDHash sha256=1a9fcc4eabde35d87326380f5eca6672a1c96f78
+CandidateCDHashFull sha256=1a9fcc4eabde35d87326380f5eca6672a1c96f78f252bbecec7f19ffdd56e420
 Hash choices=sha256
-CMSDigest=78fab81d4fef72ae942e3272ba2c9085e1893828a77095ae517ab7d3b8229ad0
+CMSDigest=1a9fcc4eabde35d87326380f5eca6672a1c96f78f252bbecec7f19ffdd56e420
 CMSDigestType=2
-CDHash=78fab81d4fef72ae942e3272ba2c9085e1893828
+CDHash=1a9fcc4eabde35d87326380f5eca6672a1c96f78
 Signature=adhoc
 Info.plist=not bound
 TeamIdentifier=not set

There are a few fields that differ here, but, given that most of them seem to be hashes, it felt natural to focus on the Identifier field (especially since that appears to contain our file name).

In the manual page for codesign(1), we see some useful information related to identifier:

-i, --identifier identifier
    During signing, explicitly specify the unique identifier string that is
    embedded in code signatures. If this option is omitted, the identifier
    is derived from either the Info.plist (if present), or the filename of
    the executable being signed, possibly modified by the --prefix option.
    It is a very bad idea to sign different programs with the same
    identifier.

This gives us further hints to what is going on. Specifically, since we do not have an Info.plist, the logic must not be deriving the identifier in that way. Given this information we can assume codesign is falling back to some other value here, but it's still not clear why it isn't reproducible across architectures.

Luckily, Apple's open source page has quite a few internal libraries, which in this case covers our issue. Looking through the Security project's source code4, we can immediately see some useful information for this field:

@constant kSecCodeSignerIdentifier If present, a CFString that explicitly specifies
 the unique identifier string sealed into the code signature. If absent, the identifier
 is derived implicitly from the code being signed.

This begs the question: How is this information derived if there is no explicit identifier? Tracing this constant through the code, we can see it sets the mIdentifier field, which is otherwise only set through this logic:

identifier = rep->recommendedIdentifier(*this);
if (identifier.find('.') == string::npos)
  identifier = state.mIdentifierPrefix + identifier;
if (identifier.find('.') == string::npos && isAdhoc())
  identifier = identifier + "-" + uniqueName();

The prefix referenced here is from the --prefix field mentioned in the codesign(1) man page. Since we're also not passing that we can ignore this logic and assume the uniqueName() logic is key here (which also explains the - we see after our filename). As we trace this logic through the codebase, we begin to see the core issue:

//
// Generate a unique string from our underlying DiskRep.
// We could get 90%+ of the uniquing benefit by just generating
// a random string here. Instead, we pick the (hex string encoding of)
// the source rep's unique identifier blob. For universal binaries,
// this is the canonical local architecture, which is a bit arbitrary.
// This provides us with a consistent unique string for all architectures
// of a fat binary, *and* (unlike a random string) is reproducible
// for identical inputs, even upon resigning.
//
std::string SecCodeSigner::Signer::uniqueName() const {
  CFRef<CFDataRef> identification = rep->identification();
  ...
}

//
// We choose the binary identifier for a Mach-O binary as follows:
//  - If the Mach-O headers have a UUID command, use the UUID.
//  - Otherwise, use the SHA-1 hash of the (entire) load commands.
//
CFDataRef MachORep::identification() {
  std::unique_ptr<MachO> macho(mainExecutableImage()->architecture());
  return identificationFor(macho.get());
}

CFDataRef MachORep::identificationFor(MachO *macho) {
  // if there is a LC_UUID load command, use the UUID contained therein
  if (const load_command *cmd = macho->findCommand(LC_UUID)) {
    const uuid_command *uuidc = reinterpret_cast<const uuid_command *>(cmd);
    // uuidc->cmdsize should be sizeof(uuid_command), so if it is not,
    // something is wrong. Fail out.
    if (macho->flip(uuidc->cmdsize) != sizeof(uuid_command))
      MacOSError::throwMe(errSecCSSignatureInvalid);
    char result[4 + sizeof(uuidc->uuid)];
    memcpy(result, "UUID", 4);
    memcpy(result+4, uuidc->uuid, sizeof(uuidc->uuid));
    return makeCFData(result, sizeof(result));
  }

  // otherwise, use the SHA-1 hash of the entire load command area (this is way, way obsolete)
  ...
}

The gist of this logic is to fetch the UUID embedded in every binary and use that to derive the identifier. The reason this isn't reproducible across architectures is because the UUID is based on the content of each binary, which differs across architectures. You can print these UUIDs with:

$ dwarfdump -u main
UUID: 5413BC88-D3B1-3DFA-855A-7F47CF229020 (x86_64) main
UUID: E4472179-D9EC-3BFF-92FF-3F8E6D013184 (arm64) main

The fallback logic we see above relies on the contents of binary's load commands which unfortunately also differs based on architecture.

Summary

While this was a very informative deep dive into this logic, if you rely on reproducible binaries and want to support Apple Silicon machines, you need to do 2 things for binaries without Info.plist files:

  1. Don't allow the linker to automatically sign your binaries by passing -no_adhoc_codesign
  2. Pass an explicit identifier when linking binaries with --identifier to the codesign invocation

I filed a radar about this behavior: FB9681559 (Make codesign for fat binaries reproducible across architectures).

Updates

  • After working around this codesign issue it was noted that the same issue with random binary names also affects the UUID computation. To workaround this you can pass -Wl,-no_uuid but note this has a negative impact on LLDB attach times for binaries if you need to debug them.
  • I landed a fix in clang to make these temporary binary names reproducible. It will hopefully be part of a future version of Apple's clang fork.
  1. This behavior can be disabled by passing -no_adhoc_codesign to link invocations. With OTHER_LDFLAGS in Xcode you need to pass -Wl,-no_adhoc_codesign since link invocations go through clang, not directly to ld 

  2. As long as you're using the same version of clang. This example was built with Xcode 13.0 13A233 

  3. Luckily if you run this on an Intel machine, instead of through arch's Intel emulation, you'll get the same results  2

  4. I was looking at 59754.140.13