A Comparison of Agile Software Development and Tabletop Roleplaying Games

 

I have a problem with Dungeons and Dragons 5th edition. It is supposed to be a game of vivid imagination. That’s what I like about it. Its many rules hold it back, though. People sit down at the table and look at their character sheet. They see that it is complicated. It has a dizzying array of abilities and skills. Someone plops down a thick book in front of them. They are intimidated.

The game master asks the player what they want their character to do?

The player looks down at their character sheet, scanning the numerous abilities and skills. Right away, their imagination is gone. They are now thinking about the rules. The scope of things they believe their character can do is limited. They will only consider the options that are written down for them.

This is a huge problem if you want to encourage story-driven games and player creativity. If such things are your priority, you have to switch to a rules-light RPG. These systems have barely any rules at all. They define just enough to get you going. The idea is, if you run into a situation where there isn’t a rule for something, you just figure it out. You have a whole team of smart people. You don’t need a rule for everything. This encourages imaginative gameplay. It allows people to get into a space where they can think creatively.

Agile Software Development is the same way. It, too, has a number of different implementations. Some of them are very heavy on the rules and procedures. The authors attempt to legislate every possible contingency.

Here’s my problem: If you complain about a rule you don’t like, advocates will throw up their hands and say, “Oh, you don’t have to use that! It’s just about the story/productivity! You can pick and choose what work for you! Do what fits your team!”

This, to me, feels super disingenuous. If you publish a huge book of rules, you know that some rules lawyer is going to memorize them all, obsess over every detail, and say things like, “Actually, on page 37, it says that we should doing blah blah blah. This is rules as written in the book!”

Nothing matters but what’s written in the book. People are like this. That attention to detail is what makes us good at software.

This is about the ivory-tower idealized view of how a complex system of rules and procedures is meant to be used verses how such systems are actually used on the ground in the real world. Some people will always hyper-focus on the rules as written. They won’t care about the spirit of the rules.

I would argue that the reason people keep “getting Agile wrong” is because they are mesmerized by the dizzying amount of procedures and rules. Just look at this SAFe chart as as example.

I don’t think modifying a large framework to your own taste is a good idea. I believe that a lighter system would work better. It makes people feel more welcome to make their own changes. My reason for this argument is human nature. People need to consider what humans do when presented with tons of rules. They follow them blindly.


Addendum: I would be remiss if I didn’t link my favorite article on SAFe. This is what I had in mind when I wrote this article. Although, it could certainly be argued that SAFe is not Agile at all.

In particular, this quote is interesting:

A key part of SAFe that I have not yet touched on is the aggregation of existing concepts like Scrum, Kanban, Lean Product, Lean UX, and DevOPs.

If you’re unfamiliar, I’d suggest exploring each concept independently over time rather than all at once. Many are valuable themselves, but SAFe doesn’t do a great job at actually synthesizing them and can sometimes add confusion. 

This reminds me of D&D games where every sourcebook is allowed and players end up creating broken characters. The analogy isn’t perfect, but the general idea is that the more disparate systems you jam together, the more complex a thing gets and the less willing its participants are to question it or to think creatively at all. 

A Historical Look at Collaboration in Tech

When I was hired on to Walmart as a programmer in 2004, we had these brown phones. As you can see from the photo below, there was no caller ID. In a global company, when that phone rings, you don’t know what fresh hell has come knocking on your door. It could be from someone in an area you’d never heard of with a complex problem in an app your team owns that you didn’t even know you were responsible for. You learn to figure things out fast.

I grew up in the 80’s with BBS’s, and in the 90’s with IRC. I always wanted persistent chat at work. But in the 2000’s, computer chat was seen as something you do in your free time on AOL. When I would suggest we use it at work, I would be told no, that people would think we were using it to goof off.

One year, I met Joshua Rowell. He had a secret way of getting chat channels at Walmart. At a large company, you can’t just install whatever you want on your PC. It’s tightly controlled with an approval process. There must be a valid business justification. He found out that the program Exceed, which was used for managing X Window sessions, also had a little-known chat feature. We started using it to collaborate with our co-workers.

I remember one day, I watched his manager walk by his cubicle, and Josh had the Exceed chat open. He minimized it as quickly as he could so as not to be caught. The manager asked, “Hey, what was that? It looked like some kind of chat program.”. “Oh nothing! It was nothing!”

I still think about that interaction. The idea that you’d have to hide interacting with co-workers seems a little crazy today. But back then, it was seen as the equivalent of playing video games on your work PC.

As the years rolled on and perceptions changed, we eventually got Microsoft Communicator. It wasn’t persistent chat, but it did give us the ability to IM people. For someone like me, who grew up online, it was a game changer. I loved it, and I used the hell out of it. My networks of people I knew grew exponentially.

Many, many years later, there was a time when the California people had Slack, and the Bentonville people didn’t. Bentonville people only had IMs. I asked several times if our area could get licenses for Slack. I was told it was too expensive for the very large number of people we had. I looked into it, and they were right. It was obscenely expensive.

I worked on Walmart’s Build Tools team at the time. Our stated mission was to provide tools for developers to improve their lives. Another developer and I, Louis Page, decided, you know what? If they won’t pay for Slack, we’ll just set something up ourselves.

This was right about the time Docker was starting to gain popularity. So, I logged into to one of the many VMs we secretly hoarded (instead of decomissioning like we were supposed to), and ran ‘docker pull rocket.chat’. It was an open source persistent chat solution. Louis and I set it up with a load balancer, added an FQDN into DNS, set up backups, and started spreading it around via IMs.

We were in a unique position in the company. Most of the developers in the company would IM us every day to help fix various build issues. We had reach. Rocket Chat started gaining popularity fast. I watched teams move in, make their own channel, and say to each other, “This is what we will use for collaboration moving forward.” I was actually pretty surprised how fast it took off and how large the userbase grew.

Docker was a great solution for this. Very easy to set up. The high load of Walmart’s IT division made Rocket Chat start crashing all the time. It was a bug with Docker’s virtual file system. Louis figured out you can make Docker restart a container if it crashed. Problem solved. No one ever noticed it restarting every 10 minutes.

It sort of wasn’t cool for us to do this. There was already a Communication Tools team who had this responsibility, and it’s actually super uncool to go behind their backs like this. But we had a great manager who would defend us, and we had just enough plausible deniability from our being a Developer Tools team that we got away with it.

So, for some period of time, I don’t remember how long, Rocket Chat enjoyed great popularity. Eventually, after many meetings between managers, it was agreed that Bentonville IT would get Slack and we would need to decommission Rocket Chat. We won. Now the whole division had Slack.

I like to think it changed the way the division communicated and collaborated. Now, I can’t imagine working somewhere without persistent chat. It would just be archaic. How would I have worked in a remote position for the past two years without these wonderful tools?

Although complaining about the poor state of things can be cathartic, it is more fun to fix things and make them better. And you don’t always have to follow the rules to do so. I would encourage people to not get stuck under the weight of procedures and past expectations. Try new things! You might end up making things better.

SRE Ops Lab

Back in January 2020, I interviewed for a Senior Site Reliability Engineer position at a company in Austin. They had a pretty cool process for testing their candidates’ troubleshooting skills. I had a ton of fun doing it, so I’m sharing it here in the hopes that other companies adopt the idea.

I’ve run into many of these situations in the past as a systems administrator. It’s 3am. People are yelling at me on the phone. I can’t think straight. What in the world did my co-workers do to this box? Why would they do these things?

The basic idea is that they spin up a temporary virtual machine with Confluence installed on it. You are expected to ssh in, start Confluence, navigate to a certain page within Confluence and tell them what’s on it. The only problem is, the instance is totally jacked up. It’s broken in multiple ways that would confound any normal person. They want details on how you approached your problem solving.

I was given an IP address, a login name, and a private SSH key. They said I’d have 60 minutes to fix everything and figure it out. The timer starts on the first ssh connection.

A few relevant facts were given:

  • The database username.
  • The application install directory.
  • The application’s home directory.
  • The Tomcat log directory.
  • The Confluence log directory.
  • The url to the webapp’s status page.

I saved my command history for my writeup to the company. I’ll include some of those below to give an idea of my process. It’s kind of neat seeing someone poking around a system they’re unfamiliar with.


The first step, ssh’ing in:

chmod 600 Downloads/id_rsa
ssh -i Downloads/id_rsa lab@IP-ADDRESS

I took the private SSH key they emailed me, fixed its permissions, and used it to SSH into the VM.

In a way, this alone is a good test. A senior should be able to ssh into a machine with a private key. It’s basic, but there are candidates out there that would be unable to this.

Once I was logged in, I checked diskspace and memory. There was nothing out of the ordinary there.

df -h
free -g
htop
top

I tried starting Confluence by running the startup script (given in my original email). Of course, it fails to start.


Problem #1: “/etc/init.d/confluence” script had incorrect directory reference (/zopt instead of /opt)

Upon opening the script in vi, I see that a directory being used is obviously wrong (/zopt instead of /opt).

/etc/init.d/confluence start
ll /opt/atlassian/confluence/bin
set -o vi
vi /etc/init.d/confluence
/etc/init.d/confluence start

I also realized I’m supposed to be using sudo to start the application. And, just like they said in their email, there were only a limited set of commands sudo was allowed to run.

sudo -l
sudo /etc/init.d/confluence start

Problem #2: JAVA_HOME referencing incorrect directory (/zopt instead of /opt)

Now, the script is complaining that it can’t find Java. The JAVA_HOME environment variable had been set incorrectly. It was looking in a path that looked reasonable, but in fact was wrong.

It’s been over a year since I completed this. I don’t remember exactly why I couldn’t fix JAVA_HOME by editing /etc/init.d/confluence. But I did have to search around and find a file that I could edit that would actually allow me to affect the environment of the service before it started. That file ended up being “setenv.sh”

Now, I’m logged in as a normal user here, so I don’t have access to edit every file. Many important files are read-only to me. So I actually had to poke around and search, not only to see where things were, but to see what I could actually touch.

psg tomcat
ps -ef | grep tomcat
vi /opt/atlassian/confluence/logs/catalina.out
vi /opt/atlassian/confluence/bin/catalina.sh
vi /opt/atlassian/confluence/logs/catalina.out
vi /opt/atlassian/confluence/bin/catalina.sh
echo $JAVA_HOME
sudo -l
/opt/atlassian/confluence
ll
cd conf
ll
cd ..
cd bin
ll
vi setenv.sh
vi setjre.sh
history
ll
id
sudo /etc/init.d/confluence start

Problem #3: Java heapsize set too low.

I tried starting Confluence again, only to find the error you get when a Java process reaches it’s max heapsize.

This is like an obstacle course. It’d be trivial if you were just messing around on your own, but a 60 minute timer really messes with your head. Any wasted time will get you in trouble later. If you go down the wrong path, even for 5 minutes, it could cost you.

At any rate, I edited the Java “-Xms” and “-Xmx” arguments in the setenv.sh file. I noticed the VM had 3g of ram, so I set the heapsize to 2g.

ps -ef | grep tomcat
ps -a
top
history
vi /opt/atlassian/confluence/logs/catalina.out
free -g
free -m
vi setenv.sh
vi /opt/atlassian/confluence/logs/catalina.out
vi setenv.sh
sudo /etc/init.d/confluence start

Problem #4: Netcat process tying up port 8080.

I started Confluence again and checked the logs. Of course, there’s another error. Something is tying up port 8080! It’s the standard error you get when Java can’t open a port. I ran netstat to check which process it was and found that it was a netcat process. Those tricky bastards. This is absolutely something I would do if I wanted to mess with someone. They are literally just toying with me.

sudo /etc/init.d/confluence start
vi /opt/atlassian/confluence/logs/catalina.out
netstat -anlp | less
psg 3603
ps -ef | grep 3603
kill 3603
ps -ef | grep 3603
ps -ef | grep confluence
sudo /etc/init.d/confluence stop
ps -ef | grep confluence
cd /opt/atlassian/confluence/logs/
ll
ps -ef | grep tomcat
mv catalina.out catalina.out.bak
sudo mv catalina.out catalina.out.bak
sudo /etc/init.d/confluence start

Problem #5: Hibernate C3P0 min and max size blank in confluence.cfg.xml.

I started Confluence again and checked the logs. The app appeared to start correctly, but when I used curl to check port 8080, it returned a 404.

I tried a few different things, but eventually I searched around and found the confluence log:

/var/atlassian/application-data/confluence/logs/atlassian-confluence.log

In that log, I saw this error:

[sf.hibernate.connection.C3P0ConnectionProvider] configure could not instantiate C3P0 connection pool java.lang.NumberFormatException: For input string “”

I have never administered Confluence before, so I had no idea what this meant. Obviously, it was some sort of misconfiguration in the app settings. Probably a blank value where it expects a number?

After searching around for the app’s configuration file, I found it at:

/var/atlassian/application-data/confluence/confluence.cfg.xml

I did some Googling on C3P0 and found what I thought might be the setting that it was complaining about: hibernate.c3p0.min_size and hibernate.c3p0.max_size were both set to blank.

tail -99f catalina.out
curl http://localhost:8080
curl http://localhost:8090
curl http://localhost:8080/
clear
ll
vi catalina.out
ll -rt
vi synchrony-proxy-watchdog.log
ps -ef | grep tomcat
netstat -anlp | grep 501
netstat -anlp | grep 576
netstat -anlp | less
iptables -L
curl http://localhost:80
curl http://localhost:8080
ll
cd /var/atlassian/application-data/confluence/logs
ll
vi atlassian-confluence.log
ll /etc/systemd
ll /etc/systemd/system
cd ..
ll
cd ..
ll
cd ..
ll
cd /opt/atlassian/confluence
ll
cd conf
ll
vi server.xml
vi context.xml
ll
cd ..
ll
ll /var/atlassian/application-data/confluence/logs
ll /var/atlassian/application-data/confluence/logs/atlassian-confluence.log
vi /var/atlassian/application-data/confluence/logs/atlassian-confluence.log
ll
find . -name "*.xml"
find /var/atlassian/ -name "*.xml"
vi /var/atlassian/application-data/confluence/confluence.cfg.xml


I tried to set min and max values, only to find that the lab user doesn’t have permission to edit confluence.cfg.xml!

Problem #6: lab user doesn’t have access to edit /var/atlassian/application-data/confluence/confluence.cfg.xml

So, now I know the answer to my problem, but I don’t have access to apply the solution.

I wasted the most time on this problem. That 60 minute timer will get you every time. If you go down the wrong path for any length of time, you’re burning time you’ll need for something else later. Anyone can think of the wrong thing at first, especially when under pressure.

I was sure I should be able to edit this file. Maybe if I just entered the right sudo command? But the sudoers file was set up such that starting Confluence doesn’t require a password, but editing the config file does require a password. I didn’t know the user’s password, though, because I had logged in via an ssh key. I spent a lot of time googling and trying to make it let me edit the file.

I thought maybe there might be some way to reset the lab user’s password without needing its old password if I dug deeply enough. I even checked if the system was susceptible to ShellShock (it wasn’t).

Finally, I realized I have access to scripts that the Confluence user runs when it starts up. Then, I felt pretty dumb. That basically gives me access to run whatever I want.

I added these lines to the end of the setenv.sh script:

sed -i ‘s/hibernate.c3p0.min_size”></hibernate.c3p0.min_size”>5</’ /var/atlassian/application-data/confluence/confluence.cfg.xml
sed -i ‘s/hibernate.c3p0.max_size”></hibernate.c3p0.max_size”>20</’ /var/atlassian/application-data/confluence/confluence.cfg.xml

After restarting Confluence. the file had been successfully updated and the error didn’t pop back up.

Keep in mind, the below is my command history. It’s not a tutorial, but a history of my panic. So, you can see all my mistakes and even me trying to perfect the sed regex.

cd /var/atlassian/application-data/confluence/
ll
vi confluence.cfg.xml
vi /var/atlassian/application-data/confluence/logs/atlassian-confluence.log
vi confluence.cfg.xml
ll
sudo -l
sudoedit confluence.cfg.xml
sudoedit /var/atlassian/application-data/confluence/confluence.cfg.xml
sudoedit /var/atlassian/application-data/confluence/confluence.cfg.xml
sudo vi /var/atlassian/application-data/confluence/confluence.cfg.xml
/usr/bin/sudoedit
/usr/bin/sudoedit /var/atlassian/application-data/confluence/confluence.cfg.xml
ll /etc/sudoers
vi /etc/sudoers
sudo -l
sudo vi /var/atlassian/application-data/confluence/confluence.cfg.xml
sudoedit /var/atlassian/application-data/confluence/confluence.cfg.xml
/usr/bin/sudoedit /var/atlassian/application-data/confluence/confluence.cfg.xml
id
passwd lab
passwd
ll /etc/pam.d/common-auth
ll /etc/pam.d
ll /etc/pam.d/authconfig
vi /etc/pam.d/authconfig
man sudo
sudo -i
ll /etc/sshd/sshd_config
ll /etc/ | grep release
cat /etc/system-release
ll /etc/passwd
cat /etc/passwd
ll /home/lab
ll -la /home/lab
vi ~/.bashrc
vi ~/.bash_profile
history
/usr/bin/sudoedit /var/atlassian/application-data/confluence/confluence.cfg.xml
cat /etc/system-release
ll /var/atlassian/application-data/confluence/confluence.cfg.xml
sudo -l
env x='() { :;}; echo Oh No\!' bash_shellshock -c "echo Testing\!"
env x='() { :;}; echo Oh No\!' bash -c "echo Testing\!"
man passwd
passwd -d
sudo passwd -d
sudo -l
ll
cp confluence.cfg.xml /tmp
vi /tmp/confluence.cfg.xml
cp /tmp/confluence.cfg.xml .
vi /tmp/confluence.cfg.xml
rm /tmp/confluence.cfg.xml
history
vi /var/atlassian/application-data/confluence/confluence.cfg.xml
ll
ll -rt /tmp
ll
cp confluence.cfg.xml /tmp
ll /tmp/confluence.cfg.xml
grep "<property name="hibernate.c3p0.max_size"></property>" /tmp/confluence.cfg.xml
vi /tmp/confluence.cfg.xml
sed '25s/hibernate/test/' /tmp/confluence.cfg.xml
sed '25s/hibernate.c3p0.min_size"\>\</test/' /tmp/confluence.cfg.xml
sed '25s/hibernate.c3p0.min_size"></test/' /tmp/confluence.cfg.xml
sed '25s/hibernate.c3p0.min_size"></test/' /tmp/confluence.cfg.xml | grep hibernate
sed 's/hibernate.c3p0.min_size"\>\</test/' /tmp/confluence.cfg.xml
sed 's/hibernate.c3p0.min_size"\>\</test/' /tmp/confluence.cfg.xml  | grep hibernate
sed 's/hibernate.c3p0.min_size"/test/' /tmp/confluence.cfg.xml  | grep hibernate
sed 's/hibernate.c3p0.min_size"/test/' /tmp/confluence.cfg.xml  | grep hibernate.c3p0
clear
sed 's/hibernate.c3p0.min_size"/test/' /tmp/confluence.cfg.xml  | grep hibernate.c3p0
sed 's/hibernate.c3p0.min_size"/test/' /tmp/confluence.cfg.xml  | grep hibernate.c3p0|test
sed 's/hibernate.c3p0.min_size"/test/' /tmp/confluence.cfg.xml  | grep hibernate.c3p0 || test
sed 's/hibernate.c3p0.min_size"/test/' /tmp/confluence.cfg.xml  | grep hibernate.c3p0\|test
sed 's/hibernate.c3p0.min_size"/test/' /tmp/confluence.cfg.xml  | grep 'hibernate.c3p0\|test'
sed 's/hibernate.c3p0.min_size"></test/' /tmp/confluence.cfg.xml  | grep 'hibernate.c3p0\|test'
sed 's/hibernate.c3p0.min_size"></hibernate.c3p0.min_size">1</' /tmp/confluence.cfg.xml
sed -i 's/hibernate.c3p0.min_size"></hibernate.c3p0.min_size">1</' /tmp/confluence.cfg.xml
vi /tmp/confluence.cfg.xml
sed -i 's/hibernate.c3p0.max_size"></hibernate.c3p0.max_size">20</' /tmp/confluence.cfg.xml
vi /tmp/confluence.cfg.xml
sudo -l
history | grep start
ll
ll
cd ..
ll
cd ..
ll
cd /opt/atlassian/
ll
cd bin
ll
cd..
cd con
cd ..
cd confluence/
ll
cd bin
ll
vi setenv.sh
history
history | grep confluence.cfg.xml
vi setenv.sh
/etc/init.d/confluence stop
sudo /etc/init.d/confluence stop
ps -ef | grep tomcat
sudo /etc/init.d/confluence start

Problem #7: Database connection failed.

After restarting Confluence and checking the logs, I saw that the database connection was now failing.

2020–01–27 16:42:44,498 ERROR [C3P0PooledConnectionPoolManager[identityToken->1hge1g9a71sc7vxxewi5qi|31c9b3a1]-HelperThread-#2] [org.postgresql.Driver] connect Connection error:

org.postgresql.util.PSQLException: Connection to REDACTED.us-east-1.rds.amazonaws.com:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.

at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:265)

I checked confluence.cfg.xml for the connection settings, and I didn’t immediately see anything wrong.

I tried telnetting to the database port to see if it was open, but telnet wasn’t installed.

I pinged the DB hostname, and it responded. Perhaps someone forgot to start the database up? I didn’t have access to it to check. I thought about running nmap against the address to check if maybe it was on a different port, but nmap wasn’t installed. That would probably require root anyway. I started to run the postgresql CLI tool to see if that was installed, to maybe troubleshoot, when I was logged out of my shell and was unable to get back in. My 60 minutes was up!

I didn’t have more than a few minutes to troubleshoot problem #7 because I had spent so much time trying to get sudo to let me edit that stupid config file. I was super engaged in this. I really, really wanted to try it again. I was mad that I didn’t finish, and I couldn’t stop thinking about it for the rest of the week.

The only consolation I have is the assurance from the company that I was very close to the end of the test.

At any rate, this was a really neat exercise. I imagine that they’d have this automated, so they probably spin it up for a new candidate and tear it down once it’s done. Certainly, the timer was automated, because I was disconnected after exactly 60 minutes.

Pretty neat!