Back in January 2020, I interviewed for a Senior Site Reliability Engineer position at a company in Austin. They had a pretty cool process for testing their candidates’ troubleshooting skills. I had a ton of fun doing it, so I’m sharing it here in the hopes that other companies adopt the idea.
I’ve run into many of these situations in the past as a systems administrator. It’s 3am. People are yelling at me on the phone. I can’t think straight. What in the world did my co-workers do to this box? Why would they do these things?
The basic idea is that they spin up a temporary virtual machine with Confluence installed on it. You are expected to ssh in, start Confluence, navigate to a certain page within Confluence and tell them what’s on it. The only problem is, the instance is totally jacked up. It’s broken in multiple ways that would confound any normal person. They want details on how you approached your problem solving.
I was given an IP address, a login name, and a private SSH key. They said I’d have 60 minutes to fix everything and figure it out. The timer starts on the first ssh connection.
A few relevant facts were given:
- The database username.
- The application install directory.
- The application’s home directory.
- The Tomcat log directory.
- The Confluence log directory.
- The url to the webapp’s status page.
I saved my command history for my writeup to the company. I’ll include some of those below to give an idea of my process. It’s kind of neat seeing someone poking around a system they’re unfamiliar with.
The first step, ssh’ing in:
chmod 600 Downloads/id_rsa
ssh -i Downloads/id_rsa lab@IP-ADDRESS
I took the private SSH key they emailed me, fixed its permissions, and used it to SSH into the VM.
In a way, this alone is a good test. A senior should be able to ssh into a machine with a private key. It’s basic, but there are candidates out there that would be unable to this.
Once I was logged in, I checked diskspace and memory. There was nothing out of the ordinary there.
df -h
free -g
htop
top
I tried starting Confluence by running the startup script (given in my original email). Of course, it fails to start.
Problem #1: “/etc/init.d/confluence” script had incorrect directory reference (/zopt instead of /opt)
Upon opening the script in vi, I see that a directory being used is obviously wrong (/zopt instead of /opt).
/etc/init.d/confluence start
ll /opt/atlassian/confluence/bin
set -o vi
vi /etc/init.d/confluence
/etc/init.d/confluence start
I also realized I’m supposed to be using sudo to start the application. And, just like they said in their email, there were only a limited set of commands sudo was allowed to run.
sudo -l
sudo /etc/init.d/confluence start
Problem #2: JAVA_HOME referencing incorrect directory (/zopt instead of /opt)
Now, the script is complaining that it can’t find Java. The JAVA_HOME environment variable had been set incorrectly. It was looking in a path that looked reasonable, but in fact was wrong.
It’s been over a year since I completed this. I don’t remember exactly why I couldn’t fix JAVA_HOME by editing /etc/init.d/confluence. But I did have to search around and find a file that I could edit that would actually allow me to affect the environment of the service before it started. That file ended up being “setenv.sh”
Now, I’m logged in as a normal user here, so I don’t have access to edit every file. Many important files are read-only to me. So I actually had to poke around and search, not only to see where things were, but to see what I could actually touch.
psg tomcat
ps -ef | grep tomcat
vi /opt/atlassian/confluence/logs/catalina.out
vi /opt/atlassian/confluence/bin/catalina.sh
vi /opt/atlassian/confluence/logs/catalina.out
vi /opt/atlassian/confluence/bin/catalina.sh
echo $JAVA_HOME
sudo -l
/opt/atlassian/confluence
ll
cd conf
ll
cd ..
cd bin
ll
vi setenv.sh
vi setjre.sh
history
ll
id
sudo /etc/init.d/confluence start
Problem #3: Java heapsize set too low.
I tried starting Confluence again, only to find the error you get when a Java process reaches it’s max heapsize.
This is like an obstacle course. It’d be trivial if you were just messing around on your own, but a 60 minute timer really messes with your head. Any wasted time will get you in trouble later. If you go down the wrong path, even for 5 minutes, it could cost you.
At any rate, I edited the Java “-Xms” and “-Xmx” arguments in the setenv.sh file. I noticed the VM had 3g of ram, so I set the heapsize to 2g.
ps -ef | grep tomcat
ps -a
top
history
vi /opt/atlassian/confluence/logs/catalina.out
free -g
free -m
vi setenv.sh
vi /opt/atlassian/confluence/logs/catalina.out
vi setenv.sh
sudo /etc/init.d/confluence start
Problem #4: Netcat process tying up port 8080.
I started Confluence again and checked the logs. Of course, there’s another error. Something is tying up port 8080! It’s the standard error you get when Java can’t open a port. I ran netstat to check which process it was and found that it was a netcat process. Those tricky bastards. This is absolutely something I would do if I wanted to mess with someone. They are literally just toying with me.
sudo /etc/init.d/confluence start
vi /opt/atlassian/confluence/logs/catalina.out
netstat -anlp | less
psg 3603
ps -ef | grep 3603
kill 3603
ps -ef | grep 3603
ps -ef | grep confluence
sudo /etc/init.d/confluence stop
ps -ef | grep confluence
cd /opt/atlassian/confluence/logs/
ll
ps -ef | grep tomcat
mv catalina.out catalina.out.bak
sudo mv catalina.out catalina.out.bak
sudo /etc/init.d/confluence start
Problem #5: Hibernate C3P0 min and max size blank in confluence.cfg.xml.
I started Confluence again and checked the logs. The app appeared to start correctly, but when I used curl to check port 8080, it returned a 404.
I tried a few different things, but eventually I searched around and found the confluence log:
/var/atlassian/application-data/confluence/logs/atlassian-confluence.log
In that log, I saw this error:
[sf.hibernate.connection.C3P0ConnectionProvider] configure could not instantiate C3P0 connection pool java.lang.NumberFormatException: For input string “”
I have never administered Confluence before, so I had no idea what this meant. Obviously, it was some sort of misconfiguration in the app settings. Probably a blank value where it expects a number?
After searching around for the app’s configuration file, I found it at:
/var/atlassian/application-data/confluence/confluence.cfg.xml
I did some Googling on C3P0 and found what I thought might be the setting that it was complaining about: hibernate.c3p0.min_size and hibernate.c3p0.max_size were both set to blank.
tail -99f catalina.out
curl http://localhost:8080
curl http://localhost:8090
curl http://localhost:8080/
clear
ll
vi catalina.out
ll -rt
vi synchrony-proxy-watchdog.log
ps -ef | grep tomcat
netstat -anlp | grep 501
netstat -anlp | grep 576
netstat -anlp | less
iptables -L
curl http://localhost:80
curl http://localhost:8080
ll
cd /var/atlassian/application-data/confluence/logs
ll
vi atlassian-confluence.log
ll /etc/systemd
ll /etc/systemd/system
cd ..
ll
cd ..
ll
cd ..
ll
cd /opt/atlassian/confluence
ll
cd conf
ll
vi server.xml
vi context.xml
ll
cd ..
ll
ll /var/atlassian/application-data/confluence/logs
ll /var/atlassian/application-data/confluence/logs/atlassian-confluence.log
vi /var/atlassian/application-data/confluence/logs/atlassian-confluence.log
ll
find . -name "*.xml"
find /var/atlassian/ -name "*.xml"
vi /var/atlassian/application-data/confluence/confluence.cfg.xml
I tried to set min and max values, only to find that the lab user doesn’t have permission to edit confluence.cfg.xml!
Problem #6: lab user doesn’t have access to edit /var/atlassian/application-data/confluence/confluence.cfg.xml
So, now I know the answer to my problem, but I don’t have access to apply the solution.
I wasted the most time on this problem. That 60 minute timer will get you every time. If you go down the wrong path for any length of time, you’re burning time you’ll need for something else later. Anyone can think of the wrong thing at first, especially when under pressure.
I was sure I should be able to edit this file. Maybe if I just entered the right sudo command? But the sudoers file was set up such that starting Confluence doesn’t require a password, but editing the config file does require a password. I didn’t know the user’s password, though, because I had logged in via an ssh key. I spent a lot of time googling and trying to make it let me edit the file.
I thought maybe there might be some way to reset the lab user’s password without needing its old password if I dug deeply enough. I even checked if the system was susceptible to ShellShock (it wasn’t).
Finally, I realized I have access to scripts that the Confluence user runs when it starts up. Then, I felt pretty dumb. That basically gives me access to run whatever I want.
I added these lines to the end of the setenv.sh script:
sed -i ‘s/hibernate.c3p0.min_size”></hibernate.c3p0.min_size”>5</’ /var/atlassian/application-data/confluence/confluence.cfg.xml
sed -i ‘s/hibernate.c3p0.max_size”></hibernate.c3p0.max_size”>20</’ /var/atlassian/application-data/confluence/confluence.cfg.xml
After restarting Confluence. the file had been successfully updated and the error didn’t pop back up.
Keep in mind, the below is my command history. It’s not a tutorial, but a history of my panic. So, you can see all my mistakes and even me trying to perfect the sed regex.
cd /var/atlassian/application-data/confluence/
ll
vi confluence.cfg.xml
vi /var/atlassian/application-data/confluence/logs/atlassian-confluence.log
vi confluence.cfg.xml
ll
sudo -l
sudoedit confluence.cfg.xml
sudoedit /var/atlassian/application-data/confluence/confluence.cfg.xml
sudoedit /var/atlassian/application-data/confluence/confluence.cfg.xml
sudo vi /var/atlassian/application-data/confluence/confluence.cfg.xml
/usr/bin/sudoedit
/usr/bin/sudoedit /var/atlassian/application-data/confluence/confluence.cfg.xml
ll /etc/sudoers
vi /etc/sudoers
sudo -l
sudo vi /var/atlassian/application-data/confluence/confluence.cfg.xml
sudoedit /var/atlassian/application-data/confluence/confluence.cfg.xml
/usr/bin/sudoedit /var/atlassian/application-data/confluence/confluence.cfg.xml
id
passwd lab
passwd
ll /etc/pam.d/common-auth
ll /etc/pam.d
ll /etc/pam.d/authconfig
vi /etc/pam.d/authconfig
man sudo
sudo -i
ll /etc/sshd/sshd_config
ll /etc/ | grep release
cat /etc/system-release
ll /etc/passwd
cat /etc/passwd
ll /home/lab
ll -la /home/lab
vi ~/.bashrc
vi ~/.bash_profile
history
/usr/bin/sudoedit /var/atlassian/application-data/confluence/confluence.cfg.xml
cat /etc/system-release
ll /var/atlassian/application-data/confluence/confluence.cfg.xml
sudo -l
env x='() { :;}; echo Oh No\!' bash_shellshock -c "echo Testing\!"
env x='() { :;}; echo Oh No\!' bash -c "echo Testing\!"
man passwd
passwd -d
sudo passwd -d
sudo -l
ll
cp confluence.cfg.xml /tmp
vi /tmp/confluence.cfg.xml
cp /tmp/confluence.cfg.xml .
vi /tmp/confluence.cfg.xml
rm /tmp/confluence.cfg.xml
history
vi /var/atlassian/application-data/confluence/confluence.cfg.xml
ll
ll -rt /tmp
ll
cp confluence.cfg.xml /tmp
ll /tmp/confluence.cfg.xml
grep "<property name="hibernate.c3p0.max_size"></property>" /tmp/confluence.cfg.xml
vi /tmp/confluence.cfg.xml
sed '25s/hibernate/test/' /tmp/confluence.cfg.xml
sed '25s/hibernate.c3p0.min_size"\>\</test/' /tmp/confluence.cfg.xml
sed '25s/hibernate.c3p0.min_size"></test/' /tmp/confluence.cfg.xml
sed '25s/hibernate.c3p0.min_size"></test/' /tmp/confluence.cfg.xml | grep hibernate
sed 's/hibernate.c3p0.min_size"\>\</test/' /tmp/confluence.cfg.xml
sed 's/hibernate.c3p0.min_size"\>\</test/' /tmp/confluence.cfg.xml | grep hibernate
sed 's/hibernate.c3p0.min_size"/test/' /tmp/confluence.cfg.xml | grep hibernate
sed 's/hibernate.c3p0.min_size"/test/' /tmp/confluence.cfg.xml | grep hibernate.c3p0
clear
sed 's/hibernate.c3p0.min_size"/test/' /tmp/confluence.cfg.xml | grep hibernate.c3p0
sed 's/hibernate.c3p0.min_size"/test/' /tmp/confluence.cfg.xml | grep hibernate.c3p0|test
sed 's/hibernate.c3p0.min_size"/test/' /tmp/confluence.cfg.xml | grep hibernate.c3p0 || test
sed 's/hibernate.c3p0.min_size"/test/' /tmp/confluence.cfg.xml | grep hibernate.c3p0\|test
sed 's/hibernate.c3p0.min_size"/test/' /tmp/confluence.cfg.xml | grep 'hibernate.c3p0\|test'
sed 's/hibernate.c3p0.min_size"></test/' /tmp/confluence.cfg.xml | grep 'hibernate.c3p0\|test'
sed 's/hibernate.c3p0.min_size"></hibernate.c3p0.min_size">1</' /tmp/confluence.cfg.xml
sed -i 's/hibernate.c3p0.min_size"></hibernate.c3p0.min_size">1</' /tmp/confluence.cfg.xml
vi /tmp/confluence.cfg.xml
sed -i 's/hibernate.c3p0.max_size"></hibernate.c3p0.max_size">20</' /tmp/confluence.cfg.xml
vi /tmp/confluence.cfg.xml
sudo -l
history | grep start
ll
ll
cd ..
ll
cd ..
ll
cd /opt/atlassian/
ll
cd bin
ll
cd..
cd con
cd ..
cd confluence/
ll
cd bin
ll
vi setenv.sh
history
history | grep confluence.cfg.xml
vi setenv.sh
/etc/init.d/confluence stop
sudo /etc/init.d/confluence stop
ps -ef | grep tomcat
sudo /etc/init.d/confluence start
Problem #7: Database connection failed.
After restarting Confluence and checking the logs, I saw that the database connection was now failing.
2020–01–27 16:42:44,498 ERROR [C3P0PooledConnectionPoolManager[identityToken->1hge1g9a71sc7vxxewi5qi|31c9b3a1]-HelperThread-#2] [org.postgresql.Driver] connect Connection error:
org.postgresql.util.PSQLException: Connection to REDACTED.us-east-1.rds.amazonaws.com:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:265)
I checked confluence.cfg.xml for the connection settings, and I didn’t immediately see anything wrong.
I tried telnetting to the database port to see if it was open, but telnet wasn’t installed.
I pinged the DB hostname, and it responded. Perhaps someone forgot to start the database up? I didn’t have access to it to check. I thought about running nmap against the address to check if maybe it was on a different port, but nmap wasn’t installed. That would probably require root anyway. I started to run the postgresql CLI tool to see if that was installed, to maybe troubleshoot, when I was logged out of my shell and was unable to get back in. My 60 minutes was up!
I didn’t have more than a few minutes to troubleshoot problem #7 because I had spent so much time trying to get sudo to let me edit that stupid config file. I was super engaged in this. I really, really wanted to try it again. I was mad that I didn’t finish, and I couldn’t stop thinking about it for the rest of the week.
The only consolation I have is the assurance from the company that I was very close to the end of the test.
At any rate, this was a really neat exercise. I imagine that they’d have this automated, so they probably spin it up for a new candidate and tear it down once it’s done. Certainly, the timer was automated, because I was disconnected after exactly 60 minutes.
Pretty neat!