EP 09 SERIES 1

Panel discussion at PyCon CZ 2023

aired 27 November 2023
runtime 48:08
guests Naďa Jašíková · Karel Minařík
language en

audio · ep 09.mp3

spotify apple youtube · video download mp3

English

Recording of our live panel discussion at PyCon CZ 2023 (in English). We talk with Naďa Jašíková — SRE veteran from OCI — and Karel Minařík, who spent almost 10 years at Elastic. Topics range across SRE practice, ownership, and how big distributed teams actually run things in production.

Česky

Záznam z panelové diskuze na Pycon CZ 2023 (anglicky). Mluvíme s Naďou Jašíkovou — SRE veteránkou z OCI — a Karlem Minaříkem, který strávil skoro 10 let v Elasticu.

Show notes

Video záznam na YouTube

Transcript / Přepis

show auto-generated transcript en · machine-generated · awaiting human review

Machine-generated. This text is YouTube's automatic speech recognition on the original EN audio. Expect errors in technical vocabulary, no speaker labels, missing punctuation, and the occasional surreal misrender. A reviewed transcript will replace this in time.

[Music] you build it you run it

let’s start with introductions now like who’s going to talk about this topic first yeah please welcome our guest that we are hosting the the show that um first can you na introduce yourself okay oh perfect hi uh I’m Nadia I know um two of these gentlemen the one that actually run the you build it you run it uh pcast uh wa I’m not sure if they’re going to be talking about it but please check it out if uh they won’t but anyway um my name is na uh I am a site reliability engineer which is a tongue twister but uh it’s a it’s a position it’s a role that I found my calling in and uh it was about eight years ago when I join joined this to gentlemen in a startup called AP and uh I started helping them break things and uh uh I have been uh I have been in this role for as as I said eight years startup got acquired by Oracle so now I’m in a corporation in slightly

different teams but uh I try to uh apply those principles I most definitely build it and run it so uh I know the pain points I have been in uh multiple uh multiple different roles before I landed in here uh starting with uh a technical writer project manager developer so I’ve been there and I landed here and I’m uh very happy so looking forward to uh discussing things okay right car hello yeah you go ahead should I go ahead yeah yes so my name is Carol I’m a developer primarily I’ve been working with Ruby and rails for a very long time back then actually I had no choice I had to run it also build it also run it spent couple of years writing Chef cookbooks and stuff like that uh then I joined elastic the company behind elastic search where I was doing ruy and go as well and strangely enough I joined the infra team there so uh yeah I have a couple of stories from that part of my life maybe your yeah uh I am lisl uh I

am started as Web Master many years ago and after that the developer and as Naja mentioned 80s ago we We join startup APR and I was uh leading their SRI team for that uh almost uh 78 years before I left Oracle uh and now I am working in Pure Storage as developer that’s very nice work again not be manager and working as I see so and I’m I’m vbal most people call me vilda and uh they kind of like introduced me halfway like I worked with those guys for quite a while I’m still in Oracle still doing Cloud uh and currently I’m they call it architect which usually translates to some stuff plus engineer uh and um I did logging there I did various other projects and before I was developer for quite a long time not people thinks this maybe route like navigation M that back in 2004 or something that scared me a lot and people start using it uh and I was I actually started being uh my first job was being on call so uh I started with

the r part and didn’t really do much of the development uh and there was cas systems in U carefor that’s like a French Desco something like that uh so maybe now introductions are done you know who everybody know knows who we are dealing with so maybe let’s start with the first first sort of topic which we selected as a good idea to start with and that’s uh what does the U build that you run it bring for developers what are the advantages disadvantages is it a good idea like why even border with it why what it brings when you bother with your own infrastructure you know logging and trying to solve all these things I don’t know who want to go first any ideas so I just have a very quick one um I mean to me this is uh this is the way I have been doing things for for quite a long time and uh what It ultimately brings me is uh the ownership of the service so I am responsible for what I wrote I can interact with the

customer I can I can see uh how they’re using the product I can uh I can improve it so that is actually useful for them so to me just uh sticking with the you build it uh gets gets rid of uh a a whole lot of other um advantages or um good things that it can bring so ownership and uh uh making sure that I’m there uh the the all the all the way uh from the initial idea to how we’re maintaining how we’re running things and how we’re improving so that it’s useful so that it actually works I I kind of Mard a question can you very shortly repeat that bu done yes uh why do it this way why you what what it brings to a developer when they also do the rest yeah like why do it at all why not just build it why also run it yeah hire someone to run it do everything else right like so I think sometimes you have no choice right that that was my case you know

back before elastic I was working at a monitoring system for social media called social Insider here in CCH Republic and we were just two developers there was no devop no SES no no nothing like that so I had no choice uh I think that uh what n said you know resonates with me because then you can really understand how the service really works how the different component on are interconnected right some people when you say you’re an architect they think it’s a four-letter word right although it has more letters uh when you also do infrastructure you can actually you know touch the uh the architecture itself so that that’s a big big thing for me L what’s your take on it before I I always see that ownership as main thing that I work at in classic companies for example example in LMC that we have that quite different we was developers we working on jobs. CZ and similar sites and we have the CIS admins who who deploying things and we usually throw the Cod over the wall to them and

they was responsible for un call only and I really hate it in that way that I always uh have that ownership I I was uh always care about how the deploy was and discuss with his admins where was issues and uh other things but that was common practice Yeah I didn’t see quite lot of changes but after I left LMC and joined the startup that was only six Engineers yeah and we have to done everything and that was the great uh uh great things that that was Heroku as pass uh quite new but was very successful in the days and using for deploying okay you just get push your application and you run it and was very easy to get it as developer yeah I use for my hobby project before I join the startup and I have plenty experience with that and I really like that we don’t need care about machines itself yeah and it was more first touching 8 years ago with cloud and later we we get AWS and we have some

services in AWS we was one of uh first uh users for lambdas for example in AWS and we we have big issues with that service that it was very early stage and uh we we we find many many pitfalls there and but was super good for that our use case for parsing API specs and was quite nice and during all the time that ownership and uh communicate with our customers via customer support and all the things if something broken we fix in a few days or in a few hours and it was really good experience for me and that’s why I like it you build it t yeah go ahead if I me uh what L was telling uh reminded me of the AP times and there’s one more thing that is very important to me that it brings that whole team together so we were actually talking about it uh uh earlier today that uh our CTO a person who’s normally supposed to you know do this all high level stuff uh he was on call he was

doing platform on call and all of us hated it when he was on the shift because he would come up with all these improvements that we have to do so he would cut like 15 tickets after a 12-hour shift but uh it uh it really brought us uh brought us together like a team so that uh it wasn’t important uh the titles were not important what was important is that all of us were working to uh to get the product running to get the service going and uh it wasn’t important if you were a junior who joined uh a few months ago or if you were one of the people who founded the company it was um all equal in this and I really enjoyed that so because I started the way I started being basically first on call uh I actually got high on being able to affect things super fast and if I seen a problem I could fix it straight away and don’t you know wait cut the ticket and you know 3 weeks pass nothing is

happening very frustrating and I also made the other experience the other way around where I delivered something and then 3 months later I received a ticket that something’s not working and I was like I don’t remember right like I’m lost so I think that from for me from my perspective the thing that makes uh brings the best value is uh or the biggest value is the feedback you get and the chance to actually affect things like fairly fast right and uh so there’s I see a good question wouldn’t this this curious child asking I’m picking this one up because I think it’s to the topic wouldn’t this ruin the best man for the job principle in B projects and cause knowledge G issue being done uh being there done that and I think the answer to this is if you specialize you become siloed and um then you in my experience don’t uh possibly don’t you don’t see the issues anymore right like oh I’m working on this are doing that and you know I don’t care

about the rest uh some of the things are really subtle and uh from your from the like let’s say specialist perspective everything’s 100% fine but then running that thing that somebody developed and works perfectly fine on the machine may be a horrible experience but there’s no way how to convey that right so I think everybody who’s uh if you’re delivering something you should feel the pain if it doesn’t work so uh that’s my take yeah I I think that it’s a valid question but it’s a a question of scale you know it depends on how big the project is evidently you know for a super large project like Monsters project like Facebook who probably want to have like a dedicated team who maintains the infrastructure you know although I think one thing comes to my mind that one thing changed this landscape quite a lot and that’s actually doer because Docker change things how you run it how you build it and kind of mesh those two together because it’s normal these days to just

you know run even on your own notebook the application as a do container you know build it run it like in the commands themselves so that’s that’s being a big change I would say over the years I I kind of was involved in this yeah in positive and negative uh both yeah yeah of course that’s putting Docker images that have 1 Gigabyte yeah and pushing over network is quite painful I remember that from Oracle so actually uh to the point of that there are teams running infrastructure it’s totally correct that the large organizations do it I work in one where similar things are happening there are like different teams and you have to work with them but usually what what happens is they are establishing this feedback look somehow still so if it really really sucks and it’s your fault they’re going to push the pain your direction right and that’s U that’s usually the way you achieve the best my perspective otherwise because I as I as I said um being on uh on both sides I’ve SE

like yeah I’m feeling the pain but there’s no way how I can tell the deps hey it really is wrong it really is bad right like you can complain but nobody cares because it’s not affecting them yeah and from my experience for example in that start of a with six developers was every six weeks you have un call for a whole week and that wasn’t just infrastructure but you make review all pool request and deploying them and from my experience was that if you are responsible for deployment and for on Call that night you was very careful to to uh make the review correct if missing test you push back uh you retest everything and be very careful what you putting into production and which quality you getting and that pain that you just uh forgot on something or H that’s okay I trust this person that this will be okay that give you quite the headache that you wake up in 2: a.m. yeah uh and if you wake up S times in week 2 a.m. that’s very painful

experience that correspond back to your quality of review your pool requests yes yeah yeah yes yes yes exactly I’m I’m looking at the questions and I think they’re like sort of uh aligning what the topics you wanted to discuss anyway so one of the topics and it’s like what to do about Developers for one the run it part is way out of their comfort zone and could cause them a lot of stress here a similar Topic in like what are the skill sets what are the attitudes what are the like mindset for uh you know what are usually important in order to succeed in in in this Paradigm right right like because usually people you know good developers but then the rest what do you think what are the can I okay uh this is quite important during my time that I was s manager and usually leading helping many other teams in Oracle start with uh with on call and help other managers manage and usually start on call in developers team that’s never

been on call before we always discuss how to do it yeah and I think the key factor is training yeah and you have to be very careful who can join on call or not if you have quite a big team you can choose people there are usually some people that for example if they have problem with sleeping they cannot be on call and wake up that can have be quite disaster in their health and other things there are exceptions but the main thing is some some fear from from uh and imposter syndrome about about un call and that can be prevented with good uh training uh real house uh trainings that I usually simulate uh incidents and you have someone and what is important I think in designing conol in general is a very good escalation you have to be comfort with escalating to someone else and they have to be be okay with that helping to you and this thrust between for example uh unior on on base level and some always have some more

experience on some escalation level who can always help you it’s good safety net for many people yeah may I address it a little bit because I think that L kind of mentioned on call uh the uncle part of this question I think my take on this question would be and don’t take me wrongly but there’s something wrong there’s something wrong with the architecture with the deployment because people shouldn’t be stressed about that and whatever might be wrong it’s a wrong architecture or too complicated there is not a good process it’s not automated you know it’s not reliable and then maybe something is wrong with the culture I mean in many teams where people just assume assume you are a you know superhero when it comes to Linux and you know every df- and du the chef and whatever is there I’m not for instance right so I frequently need to look up things you know so that would be my take that if that happens if some somebody’s really stressed about this you know have a look

at your processes and have a look at your culture you want to add something I wanted to add tooling I mean Ki mentioned it but uh specifically um you are in control of the tooling that you provide to your uncle Engineers so if they don’t feel safe uh operating the environment that there’s a chance that uh you can improve you can bring uh some some tools that will make it easier for them tooling and documentation and uh if they’re still unsure then as lad said if you have a bigger team then you can basically have people choose but uh if uh you provide them a safe space to operate so that uh they feel okay to uh troubleshoot to debug things in production if you provide them good tooling uh then um I’d say most of the people uh would would be willing to do this I agree with all that but I wanted to actually add a like a sort of a slightly different point of view for me the run it part is uh uh it starts with

um merging usually and then it’s like almost like let’s say maybe not merge but maybe it’s uh when you hit some deploy button or something you start deploying that’s the that’s the part where um the run it starts from my perspective and I think as um it was already said it should be simple it should not stress you out if it’s that’s stressful then apparently you have a gap and you have to fix it right uh my first boss um had this um maybe not correct not not 100% correct but uh approach to it that was like if you drink five beers you should be still able to deploy with no problem right that was the measurement of of most of the things is like operational procedures you’re drunk you should be able to do it either fails and runs everything is still fine or you succeed basically that level of skill was needed and it was you know he was testing it sometimes and that’s how you drop a production database yes exactly um so

you don’t get access to these things right like you make it safe it’s one part and the other part I think most of the time the the it’s one of the benefits I see is that you if you learn all the things that you need in order to succeed in this environment you become a better developer because you better understand what you’re developing for and how it’s going to be used and stuff like that right so uh let uh I I think it’s basically most of the stress comes from not knowing what’s going to happen and how it’s going to work so uh it’s like you know if you do it two three times and then you Shadow someone and see how how things are done most of the time the stress is not there anymore especially if you have like a lar di mentioned escalation paths right uh I think that’s something that can be trained basically and it’s wor training it yeah and I think the real training have some real scenarios for example from some

previous incidents and how how train people on that it’s very useful or if you introduce some new technology uh that’s quite useful to make for example some other team uh to to make them some issue some scenario that something broke and they have to fix it by the documentation that other team doing and for example I work at in in product board and we have two INF structure teams there and I lead one and the other team prepare the Hashi cor Vault for whole company they prepare documentation and we make that house uh house uh day Training Day the other team have to resolve the incident yeah and we learn too many things in a few hours fix the documentation and all that things and was super useful before we go to production and it was nice experience can be done in many different ways may I emphasize something because what n said was really important the tooling and the tooling part and to me as I read that question Sor uh uh that emergency

documentation and how to how to do that I to get there yes what cool cool cool exactly iuce that concept so I think that tooling is much more important in this sense actually because frequently and you know it was in infra elastic as well you hear something like you should write it down you have to write it down we should write it down nobody writes it down if they do they don’t like it writing it down and then it goes out of sing with the world anyway in my experience it just doesn’t make any sense to do that with one big exception and that’s postmortem which like I mentioned postmortem are super useful and it’s you know topic for another half an hour but tooling is important because if you have good tooling and this tooling I mean also logs you know and dashboards and all that and you can actually understand what went wrong you can see what’s happening in the infrastructure then you don’t need emergency documentation because that’s that’s your answer then I might disagree a little bit I

disagree too okay bring it on go ahead so uh if I uh if I am about to choose between uh tooling and uh nothing at all uh then uh I would I would put the documentation in the middle so um of course you can have fantastic ideas on automating things on putting together scripts that will do stuff for me but uh uh writing a few notes on the way as I’m as I’m resolving an incident can go a long way so and that already constitutes your documentation uh to answer that question though I think uh a very important thing is to note that uh you don’t create emergency documentation when you’re actually trying to solve the emergency you might want want to make a few notes but uh uh try to focus on actually resolving or at least mitigating the issue before trying to come up with the whole process and tooling uh to um to go with it but uh yeah I think documentation is is pretty important I do agree that it gets out of

date and uh that it needs to be maintained but I think that also boils down to the culture of uh the on call if I find something wrong with the Run book if I find something wrong with the documentation when I’m when I need it then it will be the first thing that I do once I have mitigated the problem to go and fix it I will put it this way uh when when there’s a another problem and you see that like your page or something you’re you’re the one who should should fix it uh and there’s no information uh you fix it and you write down roughly how you fixed it it’s and then you’re leaving the space better than it was if you have time and it’s a way how you encode that write it somewhere and like you know fix is forever cool if not at least you’re leaving the trail for someone else if it happens again at least something it might be out of Sy but it gives you like

maybe some context maybe oh this is the stuff I could try and see if it works or not and stuff like that that’s more like the postmortem right that that was the exception I met there yeah so I think the emergency documentation is something like FAQ you don’t write them up front ever because the fake use written up front are the questions nobody asks right uh shouldn’t be more as run books for specific alerts the emergency documentation that’s the that’s how you build yeah yes emergency documentation equals in my view like a runbook yeah right um in the operations lingo it’s called runbook the stuff that’s basically the procedure you follow to fix things and from my perspective uh you start with nothing then you have a really bad document maybe third time you have a really good document and the fourth time is like hey it happens so often that we should really do something about it and basically encode it either in a tool or fix the problem that the software has right like U I don’t know

like this so uh I give you some specific example which I really a PO is horrible but there are things in you know in which I’ve worked with where uh when they run out of space they go and delete some files and I asked them many times like why is not like you check it space and then you delete it automatically because it’s always like TMP files and you do that and fixes it and they are doing that for last three or four years still manually and it’s still not even documented they always ask like what are we supposed to do oh you go and delete this so that’s the ATI pattern right down but start your dos every day that was and if you know mon it and my nice tool that running my VPS that’s restarting wrong processes every day I think it’s and cleaning space yeah I think that we have a question about that you know just on the top about the observability part so yes what do you think about observability in my

experience typically s say they don’t have time for that should a team be dedicated on our observability I think it’s very wrong is sari thinks that it’s not their priority observability that’s something really wrong with sis if you look on S pyramid the observability is the main things there and yet if you are if you need special observability team it’s not about using observability but that your observ running your observability platform internally for costs or some specific uh compliance reasons uh that should be observability who making the platform but SES are about that make visibility in your production make correct dashboards and alerts primary alerts not and make all observability solution working for them to debug things and for people on onle that easy find the the problem well honestly I think like in most cases so far what I’ve seen is s is like oh there are deaths and there are sres and they do that and we do this and uh you have The Silo and it’s it’s not really the same people and there’s very

often a disconnect and not enough uh feedback loops and actually you know I would say that observability the observability part can kind of bridge uh this distinction you know it can turn down the wall because observability data are really you know interesting even for developers the people who build it even if they don’t run it you know and um you know even more extreme than L was I mean without observability you have nothing like that’s that’s crazy if the process is to just you know SSH into a server and randomly delete some logs you know whatever is larger that that’s crazy that’s a crazy part then yeah sure that’s exactly my my thing I remember all times that I was super happy if I have RC loog and all web servers I have on one place that I can use grab that was the piece of art in in uh 2000 or something that yeah but current obser ability platform as data do or something that they are super good if you instrument your application Especial now

with open Telemetry uh you can easily have control on your observability data you can easily use multiple multiple observability platforms that uh for different reasons and you can have very good insights on many level metrics uh traces logs what you need Ian from like even from the name right reliability engineer how do you know it’s reliable if if you if you don’t see what’s going on how do how you can tell you assume well but then you’re maybe not a reliability engineer anymore and you know your assumptions engineer yeah hope like all of them are yeah hope is not strategy yeah and also if you don’t measure it you don’t know what you’re dealing with right and you can’t manage it right like usual like you don’t measure it you don’t manage it and therefore but basically don’t what’s going on I have one thing to add if I may I mean I’m I personally am anary and I’m crazy about dashboards but uh if people are looking for excuses not to touch them then again you might have a

problem in the in the tooling um these things should be easy everyone should feel comfortable updating uh updating a dashboard or uh adding some instrumentation to your code so if it’s not easy then uh make it easier and you will have people you know more interested I will go with the next question yeah I wanted to actually address it but I wanted to ask it’s an so I don’t know who asked uh the it’s not a question it’s a statement which um I think is controversial at least to me I think I think it’s good to experience firstand in startup Etc but not sustainable nor effective for bigger projects and I would like to ask like what is a bigger project because I work with people on really large projects and they’re still doing it so yeah and how how we phrased the the you build it you run it is from AWS yeah and I think they are quite big so yeah and they are using all services running you build it you run it yeah

that I think that’s quite big yeah so I wanted to ask like what’s the bigger big bigger like how large it should be uh the I think the difference is that for example uh in my experience you have you sort of have for example certain like uh subject matter Express for certain areas if this really go goes wrong and like the average engineer can really solve it at the point because he’s wor working mostly in the mo other components then yes you pull people in you basically uh page someone in and ask Escalade and ask for help but uh most of the time your like you’re doing it right you don’t get that many pages you don’t get page I mean like something went wrong and somebody needs to attend to it and you build the tools automations you improve the architecture so that it actually is able to handle it so uh it’s still sustainable the the the thing is that bigger project doesn’t if bigger project equals more people doing operations then you’re not doing it the

right way you want to have you want to maintain sublinear scaling in that case I think something that is much easier in a startup setup if you have just a smaller team is uh that the communication is just easier you essentially shove those people in a conference room and then they figure things out and everyone is kind of aware everyone is uh pretty much uh understands the the high level Concepts if you have a bigger team if if the orc grows uh larger then uh the communication the handovers the unpopular documentation those might actually help uh keep people interested keep them uh feeling the ownership I think the ownership uh is is extremely or is much harder to um Foster in a in a big bigger orc and it’s important to put effort in it to to put attention to that one yeah and what I see the biggest issue with for example if you can introduce SRI in your company and I am advising a few companies about that it’s culture change yeah it’s not about size

of company but how you can Implement some culture changes in your company and sometimes the not super big companies they are quite good structured and have some structure the biggest issue is some midsize company that it’s not still in some shape that it’s culture is not super strength uh in in some ways and it’s biggest biggest problem for for these changes yeah it’s not really what from my experience the technical issues but more is culture people problem about communication about changes that for everyone who is involv is quite big change on their personal lives and other things and this should be good communicated from management and and all the process can be quite painful and it’s good to have time for that we discuss on an uh our previous panel discussion in for example atakama here in Prague they are starting doing s and they gives time almost year to to to properly in make it in company and I think it’s quite important that they try everyone and they they really ask people what is problem and how to do it that

that’s big culture change it relates to a question which thas ask like how to change mind from works on my machine to own it and to end what would you suggest from my perspective it’s a um like it depends what you’re after but it’s sort of like a definition of done from my perspective either works end to end or it’s not finished so I would again say it’s a it’s a question of architecture you know if you can run the same project on your own machine end to end with whatever it is possibly Docker but whatever that then there’s something strange about the whole thing to me at least yeah I am big fan running on my machine still but yeah currently with Docker and kubernetes running on my machine it’s almost identical environment as us in production and that’s always I look on that and yeah sometimes is hard special that you have to deal with clouds and they have some specifics that are not easy understand or uh discover the the issues that you don’t see on your

machine and see only in platform special arback and uh uh identities and similar things that are quite too hard to to to do it but yeah I think uh as you you mention the real uh done is running in in in some environment at least staging that are very similar to product or running in production that should be always aim for developers and into that definition of done you can then shove more things such as observability documentation making sure that the customers know I’m sorry I just keep on talking about documentation today um making sure that uh the people can actually use uh the feature that you brought so being done should mean many more things than just uh merging your pull request to the main br brunch yeah and all measure everything that you know if really someone using your feature for example or you forgot some feature toled and there are no no no one see that feature this touches on the next question I see and it’s like recording of deployments to Pro anybody using it

for evidence and understanding issues later during the Retro sessions I actually think if you need to record your deployment you’re doing something wrong because it should be repeatable and you should be doing the same thing every time unless it’s something really extreme and then it shouldn’t be one person doing it right but you have a lock like uh preferably so we could consider that we could consider that and AIT looks for compliance reasons you have all informations there but yeah it depends on your size yeah I mean there’s an exception might be a kind of strange migration you know database migration a complete switch of Technology something like that there I would say it’s super useful to have some kind of recording textual visual for that but those as vaa said should be exceptions it should be normal like get push and then you wait and it happens or doesn’t happen and it yeah we are getting uh quite a run of time that maybe we can take the next question that I really like we taking questions all

the time yeah okay that we go with establish on call and how you deal with compensations yes of course money yeah money is important money sols it so what’s the problem yeah you can see different approach and usually the the right approach is you just follow the law yeah and that you follow the local law uh that’s quite issue yeah that’s um in Oracle that’s we mentioned that we we started and you get the on call and you are in US you don’t get any money yeah you just get on call and after 20 years you you are Java developers and just manag to send you a from next week you are on call and no compensation that’s quite B yeah there are some companies in us that give you some compensation but it’s not mandatory they don’t have any regulations about that and it’s quite painful and for example I have te across us uh Great Britain and and and Prague and uh in whole Europe Union in the time uh we have regulations we have

quite nice I think the laws that figure out how compensation have to be uh for uh your employees and uh works very good and I think you can make it better if you want I for example very like how snam do c said doing that here they give you extra vacations for for something that’s I think much more better than than money in some for many developers that you have more free time but you need to have capacity in the team yeah that many people are on vacation um um but the money is always good too and that’s I mean you can model it like you know like emergency doctors have it and you know there’s like a lot of examples where kind of works right like some shifts you’re compensating for being ready and then you’re compensating if something really happens right uh and and the form of compensation could be you get a day off if it’s horrible for example right like it’s it’s uh really depends on uh culture in the team what

people want and so on but without compensation really sucks yeah I think it’s good if you have Choice that’s always good and there is uh you from our law there are two possible way how to it you can have some some rate per week or something per day or phone call or you can have exact calculated hours yeah and you have different size for standby and different for for overtime and over weekend and sundies extra some extra money that for example Oracle follows so the other thing is you could be on call and nothing happens that’s fine right like nobody’s better but if you’re woking up like times during the night it’s a big problem and it’s like even the money to can’t buy it because you know you get that’s why it’s good to have choice and you can have no I think well I actually think that this is something that it should be taken sort of into account from the like management and business perspective is you can’t really exhaust the people the way this

way and if if you’re doing it this way so that it really sucks so much you probably need to fix it may I address something there was a followup question which touches this about you know wake up at 3:00 a.m. and working through the weekend the the Assumption I hear here from everybody is that when you on call it’s 2: a.m. and you get waken up and you have to get up and get to your computer this should happen once in 3 years or something like that I mean to to me that’s a definition of correctly working processes it shouldn’t be absolutely regularly that you wake up at 2:00 a.m. or whatever am uh and that’s again you know tooling observability some kind of Auto healing in whatever sense infrastructure it should should be normal that people are waking up at night because a dis space you know went out that’s crazy don’t do that but you don’t need be wake up but still we from our experience W du can mention there are many teams you have 200 teams and there

are quite big difference how their un call looks like yeah yeah exactly as you mentioned if you are un call in APR in late years you have you will be wake up one per year and someone nobody there are plenty of people who doesn’t have any incident whole year and we have different issue that that people are not familiar with to link and all that things they they don’t have real incidents and we have to thinking yeah we have to make that that house days or something that they will be familiar at least one per quarter that that’s the real production is in super good shape but V can mentioned experience I think if if you see that people are being woken up at 2 3: a.m. like 11:00 a.m. is probably fine but uh no it’s not fine like people shouldn’t be wake about the night be the exception hour before lunch is properly better but yeah uh what I wanted to say is that’s a that’s the incentives for you to fix it right exactly that’s the

point the whole point is if you are waking up 3:00 a.m. you’re not doing your job right or somebody else or somebody else is not doing it right right and and this needs to be fixed and improved that’s uh the idea of the feedback there yes and uh there is quite big issue uh about your manager that that have to make the correct um correct behavior and uh priorities for your team yeah and if it’s not happening maybe change the job honly like sometimes sometimes it’s the the business is so broken that the only way out from the engine for the engineer is go somewhere else right I’ve seen that so I’ve done that too right saying so sorry I remember there’s this team that I joined about two years ago and uh they it felt like they had this boiled frog uh attitude and uh uh there was this operational review uh meeting that I joined and uh the uncle engineer was uh saying oh in this past week shift I only had 40 pages and only half of them was

in the middle of the night and I was I was looking at it and he said oh and this was a good shift it was fine so sometimes people just get used to horrible things because they just keep on happening and nobody puts any attention to it so um as we I said sometimes just the ways either try to change this but uh be very vocal that this is wrong that it shouldn’t be like this and if it doesn’t work then leave so thank you very much uh we run out of time

[Music] he

English

Česky

Show notes

Transcript / Přepis

You Build It, You Run It!