When a 2-Year-Old Permission Change Broke OSLogin

Context

I ran into an unusual OSLogin failure while migrating a long-running server into a new GCP project. The request was to migrate from one old GCP project into multiple new ones, one per environment. This included several VMs with dev and prod versions.

The migration strategy was to snapshot the original compute instance, launch a new instance with the same configuration using the snapshot as the boot disk, and then point DNS at the new server. All of this was done via scripts and Terraform.

For one application, the developers requested that both the dev and prod VMs be created from the production machine’s snapshot because production had newer code. I deployed the development compute instance using the production snapshot and wasn’t able to log in using GCP’s OSLogin.

The Problem

At first, this looked like a normal OSLogin or IAM issue. The symptoms matched:

SSH attempts failed
OSLogin was enabled
IAM permissions looked correct
A fresh VM in the same project worked fine with OSLogin

This didn’t seem like a normal failure pattern.

Verbose SSH output only showed the usual key denial message. I used OSLogin to SSH into the original production VM, which worked. The smoking gun appeared in the SSHd logs:

error: Unsafe AuthorizedKeysCommand “/usr/bin/google_authorized_keys”: bad ownership or modes for directory /usr/bin

Why OSLogin Was Failing

OSLogin relies on a helper binary that handles how GCP centralizes SSH access. This binary lives at /usr/bin/google_authorized_keys. You may notice it uploading a generated key the first time you use gcloud ssh.

OpenSSH enforces strict security rules for SSH configuration and helper binaries. Every directory in the path must:

Be owned by root
Not be group-writable
Not be world-writable

If any directory violates those rules, OpenSSH refuses to execute the helper at all.

In this case, OSLogin stopped working, and we would never have known why without being able to log into the original machine.

Checking the filesystem permissions on the new instances showed:

drwxrwxrwx 777 root:root /usr
drwxrwxrwx 777 root:root /usr/bin
-rwxr-xr-x 755 root:root /usr/bin/google_authorized_keys

The binary itself was fine, but the directories above it were not.

It looks like a previous developer ran chmod 777 on /usr/ a long time ago. I also discovered that the system’s Python directory was owned by a user.

Why the Original Server Still Worked

The source machine had been running continuously for over 800 days and had never been rebooted after those directory permissions were changed.

The server was snapshotted live, which wasn’t what was normally done. These servers are usually suspended before a snapshot is taken.

If this server had been patched and rebooted at any point after those changes were made, it likely would have become inaccessible. I wasn’t able to find a functioning SSH key for the server’s root user.

The Fix

Correcting the directory ownership and permissions immediately resolved the problem:

1
2
3
chown root:root /usr /usr/bin
chmod 755 /usr /usr/bin
systemctl restart ssh

After that, OSLogin worked normally on the affected instances.

Why This Was Hard to Track Down

This wasn’t a typical OSLogin failure mode.

Most issues in this area are:

IAM permissions
Missing service account roles
Metadata settings
Guest agent problems

In this case, all of those looked fine.

The actual problem was low-level filesystem permission drift that had been captured in a snapshot and replicated into multiple environments.

Because the original server had stayed up for so long, the problem never surfaced there.

Lessons Learned

A few practical takeaways from this:

Snapshotting inherited state can carry hidden drift

Long-running servers can mask problems that only appear after a reboot or migration

It’s a reminder that cloning production machines isn’t the same as building from a clean base image. Sometimes you inherit more than just the data.