Although I'm not actually the CTO of Devver, I had the pleasure of attending the Boulder CTO lunch this past Monday since Dan was out of town.
This week, the group had Todd Vernon from Lijit come lead the discussion. Although Todd is currently CEO of Lijit, he was CTO at his former company, Raindance.
The group that was assembled was small but awesome - I had the opportunity to learn not only from Todd, but also from the CTOs of a few of last years TechStars companies.
The discussion touched on a ton of topics, but two (related) themes that were heavily discussed were the role of the CTO and how a company grows from a technology perspective. I've organized my notes below. Keep in mind that these are the collected thoughts from a number of different participants and I may not have captured their ideas with 100% accuracy.
The Role of a CTO
What is the difference between a CTO and a VP of Engineering?
CTO is about leadership for technical issues, interfacing with the business side, guiding the product, get people excited about product from a technical point of view.
VPoE is almost one step above Chief Architect, more on a management side, getting product delivered.
1st time CTOs need to figure out their exact role. It's a very amorphous role, depending on the company.
CTO needs to be able to tell the whole business story, understand the good parts and bad.
Early on, CTO should insert themselves into the sales process as much as possible (especially right after you hire the sales person). You need to be able to hear what customers say they want, so you can translate that into what they really need.
The Technology Onion
There is a technology onion - make sure the core of the onion is owned by company (the outer layers, not as important). The CTO needs to figure out the relationship between the technologies and the company's partnerships. The closer a technology is to the core of the onion, the more important it is to own it and to make sure it scales.
For instance, if you're depending on Google for search, you're powerless to change features if things don't work as your customers want/expect. If search is core to your business (near the core of the onion), consider building it internally. It's the CTOs role to make that case, because business people will never understand the need to spend money to get "the same thing."
Having to re-architect a core component of a company can really hurt growth. Assume you're going to be successful, so plan for that.
Along the same lines, one concern about using EC2 is that you get tied to the platform and your business is dependent on an outside force you can't control. Hosting on EC2 can be quite different than hosting your own boxes.
Acceptable Failures
What is acceptable downtime? It depends on when - between midnight to 1-2 AM, it might be OK to be down for a few minutes. CTOs need to determine what acceptable downtime is and tell that the to the rest of management and have people agree.
CTOs need to make decisions (for instance, what is the acceptable down time, acceptable data loss, or acceptable time for page load) and then tell the entire organization. That way, when something bad happens, you can explain that everyone agreed on the specific numbers. It's unlikely your business will need to be (or can be) 100% perfect on all metrics, but people need to understand what the goal is and why it's realistic.
Growing/Scaling/Monitoring
If at least one person is using your service, you should have two web servers. It gives you ton of flexibility. Having two boxes forces you to work out most of the issues early (it's a lot different getting to 2 boxes than 3 or 4). It's not about load, it's about reliability.
No matter how useful you are, if you are not reliable, someone will blog, "it's cool, but it doesn't work reliably."
Downtime spreads very fast across Twitter. Consider tweeting about upcoming service interruptions ahead of time so customers are aware.
After more than 15 people, you need a dedicated operations person. Get some basic monitoring services early - after a server is under load, it's really hard to diagnose. Try to detect stuff early, it's easier to debug.
With startups, generally the problem tends to be slow requests rather than complete service downtime. Make sure your monitoring service will alert you with slow requests.
Get app specific stuff - a warning like "High CPU load," is harder to understand (it might be a problem, or maybe the machine is just handling a lot of requests successfully), but "Page X takes 80 sec to load" is more obvious.
As you grow, try to measure more and more. Things often degrade slowly, and one day you just notice its too slow and it's hard to go back and find the root of the problem.
Make two lists: a) the most catastrophic things that could happen and b) the most likely things that could happen. Where those lists overlap, you need to fix something. But there will be some risks that you decide are reasonable risks for the business (revisit these risks regularly as things change).
Regarding backup - always make sure you try to restore some data (before you really need it). You need to make sure it works and make sure its fast enough.
You should always be able to describe at a high level how the service will scale infinitely (it doesn't have to be technically perfect, but it has to be believable). When someone wants to purchase, that'll be a huge help - the business guy on the other side of the table will want to buy, but the technical guy doesn't want to buy (he wants to build it in-house).
I hope those notes make some sense and give you a good feel for the discussion we had. I'm looking forward to attending more of these lunches (I hope they'll continue to let a few CEOs sneak in...)