A Seasoned Tester's Crystal Ball

Friday, May 17, 2024

Programming is for the juniors

It's the time of the year when I learn the most, because I have responsibility over a summer trainee. With Four years in Vaisala, it's four summer trainees, and my commitment to follow through with the summer of success means I will coach and guide even when I will soon be employed by another company. With a bit of guidance, we bring out the remarkable in people, and there will not be seniors if we don't grow them from juniors. Because, let face it, it's not like they come ready to do the work from *any* of the schools and training institutions. Work, and growing in tasks and responsibilities you can take up, that is where you learn. And then teaching others is what solidifies if you learned.

This summer, I took special care in setting up a position that was a true junior one. While it might be something that someone with more experience gets done faster, the junior pay allows for them to struggle, to learn, to grow and still contribute. The work we do is one where success is most likely defined by *not leaving code behind for others to maintain* in any relevant scale, but reusing the good things open source community has to offer us in the space of open source compliance, licenses and vulnerabilities. I have dubbed this the "summer of compliance", perhaps because I'm the person who giggles my way through a lovely talk on malicious compliance considering something akin to that a real solution when we have corporate approaches to compliance, and forget learning why these things matter in the first place.

If the likely success is defined by leaving no code, why did I insist of searching for a programmer for this position? Why did I not allow people to do things a manual way, but guided with expectations from getting started that we write code.

In the world of concepts and abstractions, code is concrete. Code grounds our conversation. If you as a developer disagree with your code, your code in execution wins that argument. Even when we love saying things like "it's not supposed to do THAT". Watch the behavior and you're not discussing a theory but something you have.

A screenshot of post by Maaret Pyhäjärvi (@maaretp@mas.to) beautified by Mastopoet tool. It was posted on May 16, 2024, 15:29 and has 4 favourites, 0 boosts and 0 replies. "You are counting OS components, show me the code" => "Your code is counting OSS components but it's a one line change". This is why I don't let trainees do *manual* analysis but I force them to write scripts for the analysis. Code is documentation. It's baseline. It is concrete. And it helps people of very different knowledge levels to communicate.

Sometimes the mistakes when reviewing the detailed steps of the work reveal things like this - a subtle difference on what is OS (operating system) and OSS (open source software), for someone who is entirely new to this space and how the effort put to writing the wrong code was a small step to fix, whereas the wasted effort with manually having done similar analysis would have really wasted more effort for the mistake.

Watching the conversation unfold, I recognize we have already moved through there bumps on our road of the programming way.

The first one was the argument on "I could learn this first by doing it manually". You can learn it by doing it with programming too, and I still remember how easy it is to let that bump ahead become the excuse that keeps you on the path of *one more manually done adjustment*, never reaching the scale. With our little analysis work, the first step to analyze components on one image felt like we could have done it manually. But the fact that getting 20 was 5 minutes of work, that we got from pushing ourselves through the automate bump.

We also run into two other bumps on our road. The next one with scale is realization that what works for 5 may not work for 6, 11, 12, 17, and 20. The thing you built that works fine for one, even for the second will teach you surprising things in scale. Not because of performance. But because data generated that you are relying on is not always the same. For the exceptions bump, you extend the logic to deal with real world outside what you immediately imagined.

Testers love the exceptions bump so much that they often get stuck on it, and miss more relevant aspects of quality.

The third bump on our week is the quality bump, and oh my, the conversation of that one, that's not an easy one. So you run a scanner, and it detects X. You run second scanner, and does not detect X. Which one is correct? It could be correct as is (different tools, different expectations), but also it could be false positive (it wasn't one for anyone to find anyway) or false negative (it should be found). What if there is Y to detect that none does, how would you know that then? And in the end, how far do you need to go in quality of your component identification so that you can say that you did enough not to anymore be considered legally liable?

The realizations that popular scanners have significant percentages difference in what they find, and that if legal was an attack vector against your company, that your tools quality matter for it, it's definitely sinking into the deep end with the quality bump.

I am sure our summer trainee did not expect how big of apart asking why and what then is of the work. But I'm sure the experience differs from a young person's first job in the service industry. And it makes me think of those times for myself nostalgically, appreciating that there's an intersection on how our paths cross through time.

Back to click-bait-y title - programming really is for the juniors. It is for us all, and we would do well on not letting the bumpy road stop us from ever going that way. Your first hello world is programming, and there's a lifetime of learning still available.

Wednesday, May 15, 2024

Done with Making These Releases Routine

I blogged earlier on the experience of task analysis of what we do with releases, and shared that work for others to reflect on. https://visible-quality.blogspot.com/2024/02/making-releases-routine.html

Since then, we have moved from 2024.3 release to 2024.7 release, and achieved routine. Which is lovely in the sense of me changing jobs, which then tests if the routine can be held without me present.

In the earlier analysis, I did two things.

Categorized the actions
Proposed actions of low value to drop

Since then, I have learned that some of the things I would have dropped can't be, as they are so built in. Others that I would keep around, I have so little routine / traction from anyone other than me and since they are compliance oriented, I could drop them too without impacting the end result.

Taking a snapshot of changes I have experimented through, for 2024.3 release I chose to focus on building up two areas of competence. I coached the team's tester very much on the continuous system testing, and learned that many of the responsibilities we allocate for continuous system testing are actually patching of product ownership in terms of creating compliance track record.

The image of same categories but actual work done for 2024.3 versus 2024.7 shows the change. No more cleaning up other people's mess in Jira. No more manual compliance to insufficiently managed "requirements" to show systematic approach that only exists with significant extra patching. The automation run is the tests done, and anything else while invisible is welcome without the track of it. Use time on doing testing, over creating compliance records.

While the same structure shows how much work one can throw out of a traditional release process, restructuring what remains does a better job describing the change. The thing we keep is the idea that master is ready for release any time, and developers test it. Nothing manual must go in between. This has been the key change that testers have struggled with on continuous releases. There is no space for their testing, and there is all the space for their testing.

In the end, creating a release candidate and seeing it got deployed could be fully automatic. Anything release compliance related can be done in continuous fashion, chosen to be dropped or at least heavily lightened, and essentially driven by better backlog management.

With this release, 30 minutes I call my work done. From 4 months of "releasing" to 30 minutes.

If only the release would reach further than our demo environment, I would not feel as much like I wasted a few years of my life as I am moving to a new job. But some things are still outside my scope of power and influence.

Monday, May 13, 2024

Using Vocabulary to Make Things Special

Whenever I talk to new testers about why they think ISTQB Foundation course was worthwhile of their time, one theme above all is raised: it teaches the lingo. Lingo, noun, informal - often humorous: a foreign language or local dialect; the vocabulary or jargon of a particular subject or group of people. Lingo is a rite of passage. If your words are wrong, you don't belong.

And my oh my, do we love our lingo in testing. We identify the subcultures based on the lingo, and our attitude towards the lingo. We correct in less and more subtle ways. To a level that is sometimes ridiculous.

How dare you call a test developer wrote unit test if it is unit in integration test not unit in isolation test? Don't you know that contract testing is essentially different than api testing? Even worse, how can you call something test automation when it does not do testing but checking? Some days it feels like these are the only conversations we have going on.

At the same time folks such as me come up with new words to explain things are different. My personal favorites of words I am inflicting the world are contemporary exploratory testing to bring it both to its roots but modernize it with current understanding of how we build software, and ensemble testing because I just could not call it mob testing and be stuck with the forever loop of "do you know mobbing means attacking". Yes, I know. And I know that grooming is a word I would prefer not to use on backlog refinement, and that I don't want to talk about black-white continuum as something where one is considered good and other bad, and thus I talk about allowlisting and blocklisting, and I run Jenkins nodes not slaves.

Lingo, however, is a power move. And one of those exploratory testing power moves with lingo is the word sessions. I called it a "fancy word" to realize that people who did some work in the exploratory testing space back in the days clearly read what I write enough to show up to correct me.

Really, let's think about the word sessions. What things other than testing we do in sessions? What are the world we more commonly use if sessions aren't it?

We talk about work days. We talk about tasks. We talk about tasks that you do until you're done, and tasks you do until you run out of time. We talk about time-boxing. We talk about budgeting time. We talk about focus. But we don't end up talking about sessions.

On our conversations of words, I watched a conversation between Emily Bache and Dave Farley on software engineering and software craftership. It was a good conversation, and created nice common ground wanting to identify more with the engineering disciplines. It was a good conversation to watch because while it on shallow level was on the right term to use, it was really talking about cultures and beliefs. The more words used around those words were valuable to me, inspirational, and thought provoking.

We use words to compare and contrast, to belong, to communicate and to hopefully really hear what the others have to say. And we can do that centering curiosity.

Whatever words you use, use them. More words tends to help.

Monday, April 22, 2024

Learning the hard way, experience

Over my career, I have done my fair share of good results in testing. There are two signature moves I wanted to write a post on, and what I learned while failing at both of my two signature moves after I first thought I had them pocketed.

These two signature moves are changes I have been driving through successfully over multiple organizations and teams:

Frequent releases
Resultful contemporary exploratory testing with good automation at heart of it

Frequent releases

Turning up the release frequency is an improvement initiative I have done now for over 10 different teams, products, and for four organizations. And the products/teams/orgs I have done this at, only of them are the optimal case of a web application with control over distribution channels. I've been through this the hard hard way - removing reboots from upgrade cycles, integrating processes that were designed for non-continuous such as localization and legal, including telemetry to know if we leave customers hanging on versions - all of that is an article of its own.

Succeeding with this is not the interesting part, except for the latest success of my current team making releases routine after dropping the ball and suffering through a 4 month stabilization period last year. I dare to call it again a success since we made our 5th release this year last week, and included the best scope and process we have had in that.

Why did we fail with a stabilization phase then? We can surely learn something from it.

The experiences point us to failing at communicating and collaborating with a non-technical product owner. How can that create a four month stabilization phase? Leaking uncertainty, and pushing uncertainty to a time it piles up. We leaked uncertainty in functional scope, but also in parafunctional scope, and when pushed to address performance and reliability, we found functional scope we had not recognized existed.

From breaking the main and not realizing how much we had broken it (automation was still passing!) cost us extra stress and uncertainty. We would look at tasks as "I did what you asked", not as "We together did what you did not know to ask".

And when we tested, we failed at communicating our results effectively. New people, new troubles, and they accumulated quickly when we did not want to stop the line to fix before moving forward. We optimized for keeping people busy, over getting things done to a shape where they could be released.

Having my team fail with my signature move taught me that there is much more of implicit knowledge on continuous timely testing than what we capture with automation. Designing for feedback, and making it timely takes learning, and this year we have been learning that with a short leash of releases.

Resultful contemporary exploratory testing with good automation at heart of it

Framing test automation as worthwhile notetaking mechanism while exploring has become so much of a signature move of mine, that I have given it a label of its own: contemporary exploratory testing. When we explore, some of it is attended and some unattended. Test automation, when failing, calls for us to attend. Usually pretty much continuously.

Working with developers, we have learned to capture our notes as lovely English sentences describing the product capabilities in some tests; as developer intent captured in unit tests in some tests; as component tests ad as subsystem, and as system-of-systems tests when those make sense. We have 2500 tests we work with on the new tech stack side, some hand-crafted, and others model-based generated test we use as reliability tests allowing longer unique flows be generated and tracked.

We learn about information we have been missing, and we extend our tests as we fix the problems. We extend the tests with new capabilities. How can all this successful routine be a failure?

It turns out there was a difficult conversation we did not have on time, one of validation. And when a problem is that you are building your product so that it does not address the user needs, none of the other tests you may have built matter. Instead of being a help, they become friction where you will explain how changing those all will be work, at a time when work is less welcome.

This failure comes from two primary sources: carefulness on a cross-organizational communication due to project setup reasons delaying the learning, but as much as that, unavailability of more senior test/dev folks on a previous project delivery.

Both the two failures are successes if we learned and can apply what we learned going forward. For me they taught a career change is in order: I don't want to find myself spread so thin that I don't recognize these expensive mistakes happening. So I will be a tester again, fully. Well, surrounded by testers, as director, test consulting, with a hands on consulting role starting June. Looking forward to new mistakes, and correcting the mistakes I make. Because: mistakes were made, and by me. Only taking ownership allows us to move forward and learn.

* recommendation of a book:

Sunday, March 17, 2024

Urban Legends, Fact Checking and Speaking in Conferences

I consume a lot of material in form of conference talks, and I know exactly the moment when conference talks changed for me forever. It was Scan Agile conference in Helsinki many years ago, and I had just listened to a talk from an American speaker. I enjoyed their experience as told from stage so much that I shared what I had learned with my family. Only to learn the story was fabricated.

In one go, I became suspicious of all stories told from stage. I started recognizing that my stories are lies too, they are me-sided recantations of actual events. While I don't actively go and fabricate them, those who do will tell me that is how human experience and memory work anyway. And the responsibility of taking everything with a grain of salt is on the consumer.

The stage seeks memorable, and impactful. And if that includes creating urban legends, it bothers some popular speakers less than it bothers me.

With that mindset, I still enjoy conference talks but they cause me extra work of fact-checking. I may go and have a conversation to link a personal story to a reference. But more often, it's the other people's stories they share that I searching online on.

In the last two weeks of conferences, I have been fact-checking two stories told from stage. They were told by my fellow testing professionals. And evidence points to word of mouth urban legends on AI.

The first story was about some people sticking tiny traffic signs around to fool machine vision based models. The technique is called adversial attack. There's articles of how small changes with stickers mess up recognition models. There's ideas that this could be done. But with my searches, I did not find the conclusive evidence that someone actually did this in production, in live environments, risking others people's health. I found the story dangerous as it came without warnings of not testing this in production, of the liability of doing something like this.

In addition to searching online for this, I asked a colleague at work with experience of scanning millions of kilometers of imagery of roads where this could be happening. It just so happens I have worked very close to a product with machine vision, and could assess if this was something those in the problem space knew of. I was unable to confirm the story. The evidence points to misdirections from stickers in busses, but not tiny stickers computer sees but human doesn't.

The story was told to about 50 people. Those 50 people, trusting that presenter from the stage would get to apply their fact checking skills before they do what I do now: tell the story forward as it was told, with their flavor of it.

The second story was told by two separate presenters. The story was about someone getting a binding contract on the company letterhead for buying a car for $1 where justice system is still out to decide whether the contract must be withheld. One presented showed it as a reimagined chat conversation translated to Finnish but made no claims on letterhead or legal mitigation, as the more colorful version was already out there presented to this audience.

Fact checking suggests that this is something that someone, specifically Chris Bakke asked from a ChatGPT-based car sales bot, to give an offer for 2024 Chevy Tahoe with "no takesies backsies", far from official offer on company letterhead.

So no binding contract. No company letterhead. No car bought for this price. No litigation pending.

The third story told was a personal one. You can find it as a practical example to analyze a screenshot for bugs. It suggested ChatGPT4 does better on recognizing bugs from an image than people do. While it may not be entirely incorrect on scope of that individual picture, the danger of this story is that the example used in the picture is a testing classic. Myers points out 7.8/14 is what a typical developer scored in 1979, and there's more detailed listings around in literature that show we do even worse in analyzing it and that it is technology-dependent.

Someone else on the conference also suggested we should not read books but ask for summaries from ChatGPT, completely missing a frame of reference on how well then the model would do compared to a human that has read many of these references. Reading less would not help us.

Starting an urban legend, especially now that people are keen on hearing stories of good and bad in the real of AI is easy. Tell a story, they tell it forward. Educational and entertainment value over facts.

So let's finish with a recognition of what I just did with this blog post. I told you four stories. I provided you no names of the people leaving impact significant enough in me to take the time to write this. It's up to you to assess, filter, and fact check and choose which stories become part of what you tell forward. Most information we have available is folklore. There are no absolutes. But there is harm in telling stories from stage, and I would think speakers would need to start stepping up to that responsibility.

Then again, it's theater. Entertainment with a sprinkle of education.

Wednesday, March 6, 2024

A sample of attended testing

Today prompted by day 6 of 30 days of AI testing, I tried a tool: Testar. My reasons for giving it a go are many:

Tanja Vos as project leader for the research that generated this would get my attention
I set up a research project at previous employer on AI in / for testing, and this tool's some generation was one of that project's outcomes
The day 6 challenge said I should
Open source over commercial for hobbyist learning attention all the way

I read the code, read the website, and tried the tool. The tool did not crash but survived over an hour "testing" our software with the standards of testing I learned some large software contractors still expect from testers, so I could say it did well. I would not go to as far as "beat manual testing" like the videos on the site did.

The tool was not the point for me though. While the tool run clicking 50 scenarios with 50 actions and what seems to be a lot more than 50 clicks I intertwined unattended testing (the tool doing what the tool does) and attended testing, and I was watching a user interface the tool did not watch for fascinating patterns.

In my test now I have two systems that depend on each other, and two ways of integrating the two systems for a sample value of pressure. And I have a view that allows me to look at the two ways side by side.

As Testar was doing it's best to conclude eventually Test verdict for this sequence: No problem detected, I was watching the impact of running a tool such as this on the integration with the second system.

I noted some patterns:

The values on the left were blinking between no data available and expected values, whereas the values on the right were not. I would expect blinking.
The values on the left were changed with delay compared to the values on the right. I would expect no difference in time of availability.
The bling-sounds of trying something with a warning was connected with the patterns I was observing, and directed me to make visual comparisons sampled across the hour of Testar running.

This is yet another example of why there is no *manual testing* and *automated testing*. With contemporary exploratory testing, this is a sample of attended testing with relevant results, while simultaneously doing unattended testing with no problem detected.

This was not the first time we use generated test cases, but the types of programmatic oracles general enough to note crash, hang, error, those we have had around for quite a while. As soon as we realized it's about programmatic tests, not automated testing. Programmatic tests provide some of their greatest value in attended modes of running them.

For me as a contemporary exploratory tester, automation gives four things:

Running Testar, I have now 50 scenarios of clicking with 50 actions with screenshots on my disk. That alone serves as documenting in ways I could theoretically benefit from if anyone asked me how we tested things. It would most definitely replace the testing that I cut from 30 days investment to 2 days investment last year, and replace one tester.

Testar gave me quick clicking on another system while my real interest was on the other one. Like many forms of automation, time and numbers are how we extend reach.

Testar did not tell me to look at things, but our applications sound alerting did. Looking at the screenshots, I saw a state I would not expect, and went back to investigate it manually. That too was helpful, sampling what I care to attend.

The last piece, guiding to detail I usually get when I don't rely on auto clickers, but actually have to understand to write a programmatic test.

Tuesday, March 5, 2024

A Bunnycode Case Study for AI in Testing

It's day 5 of 30 days of AI testing, and they ask for reading a case study or sharing your experience. I did sharing experience already on an earlier day, and in the whim of a moment, set up a teaching example.

I google for obfuscated code in python to find https://pyobfusc.com/. I'm drawn to most reproducible, authored by mindiell and when I see the code, I'm sold. How would you test this?

Pretty little rabbit, right? Reminds me of reading some code at work, work is just less intentional with obfuscation. And really do not have the time or energy to read that. I could test it as black box, learning that given a parameter of a number, it gives me a number:

As if I didn’t know what the rabbit implements or recognize the pattern in the output, I was thinking of just asking ChatGPT about it. However, I did not get that far.

Instead, I wrote def function(): on my IDE while GitHub copilot is on, thinking I would have wrapped the program into a function. It reformatted it to something a bit more readable.

Prompting some more in the context of code.

Comment line “#This function” proposes “is obfuscated”. Duh.

Comment line “#This function imp” proposes "lements the Fibonacci sequence using Binet’s formula.

At this point, I ask chatGPT how to test a function that implements the Fibonacci sequence using Binet’s formula. I get long text saying try values I already tried, but in code, and a hint to consider edge cases and performance. I try a few formats to ask for a value that would make a good performance benchmark, and lose patience.

I google for performance benchmark to learn that this Binet’s formula is much faster than the recursive algorithm, and find performance benchmarks comparing the two.

I think of finalizing my work today with inserting the bunny code into chatGPT and asking “what algorithm does this use” to get second language model generate likely answer as the Binet’s formula. Given the risk and importance of this testing at this time, I conclude it’s time to close my case.

There are so many uses to figure out what it is I am testing (while being aware of what I can share with tool vendors when giving access to code) and this serves as a simulation of idea that you could ask about the pull request. This was the case I wanted to add to the world today.

I should write a real case study. After all, that was one of the accommodations we agreed with multiple levels above me in management when my team at work started GitHub Copilot tryouts some 6 months ago. I should publish this in action, with a real time. As soon as something generates me some time.