-3

How to Scale Agile Software Development with Technology and Lean (www.infoq.com)

submitted 10 hours ago by [email protected] to c/[email protected]

0 comments fedilink

Highlights

The first scaling crisis happened in 1996, when Linus wrote that he was "buried alive in emails". It was addressed by adopting a more modular architecture, with the introduction of loadable kernel modules, and the creation of the maintainers role, who support the contributors in ensuring that they implement the high standards of quality needed to merge their contributions.

The second scaling crisis lasted from 1998 to 2002, and was finally addressed by the adoption of BitKeeper, later replaced by Git. This distributed the job of merging contributions across the network of maintainers and contributors.

In both cases, technology was used to reduce the amount of dependencies between teams, help contributors keep a high level of autonomy, and make it easy to merge all those contributions back into the main repository, Bernhard said.

Technology can help reduce the need to communicate between teams whenever they have a dependency on another team to get their work done. Typical organizational dependencies, such as when a team relies on another team’s data, can be replaced by self-service APIs using the right technologies and architecture, Bernhard mentioned. This can be extended to more complicated dependencies, such as infrastructure provisioning, as AWS pioneered when they invented EC2, offering self-service APIs to spin up virtual servers, he added.

Another type of dependency is dealing with the challenge of merging contributions made to a similar document, whether it’s an illustration, a text, or source code, Bernhard mentioned. This has been transformed in the last 15 years by real-time collaboration software such as Google Docs and distributed versioning systems such as Git, he said.

Anyone trying to scale an agile organization should study lean thinking to benefit from decades of experience on how to lead large organizations while staying true to the spirit of the agile manifesto, he concluded.

Gedimas be interneto paliko 120 įstaigų: rizikas prognozavo, bet plano B nebuvo in c/[email protected]

[–] [email protected] 2 points 6 days ago (1 children)

Būtų įdomu paskaityt tai kas ten iš tiesų įvyko ir kaip buvo tvarkoma, bet turbūt Cloudflare lygio post-mortem analizės tikėtis neverta.

10

India Is Building an Open-Source Cloud Computing Effort (spectrum.ieee.org)

submitted 2 weeks ago by [email protected] to c/[email protected]

1 comments fedilink

Will be interesting to see how it works out

The Indian nonprofit People+ai wants to fix this by creating an open and interoperable marketplace of cloud providers of all sizes. The Open Cloud Compute (OCC) project plans to use open protocols and standards to allow cloud providers of all sizes to offer their services on the network. It also plans to make it easy for customers to shift between offerings depending on their needs. People+ai held a hackathon on 20 September at People’s Education Society University (PES University) in Bengaluru to test out an early prototype of the platform.

3

Improving platform resilience at Cloudflare through automation (blog.cloudflare.com)

submitted 2 weeks ago by [email protected] to c/[email protected]

0 comments fedilink

Highlights

Failure is an expected state in production systems, and no predictable failure of either software or hardware components should result in a negative experience for users. The exact failure mode may vary, but certain remediation steps must be taken after detection. A common example is when an error occurs on a server, rendering it unfit for production workloads, and requiring action to recover.

It can be tempting to rely on the expertise of world-class engineers to remediate these faults, but this would be manual, repetitive, unlikely to produce enduring value, and not scaling.

The commonality of lower-priority failures makes it obvious when the response required, as defined in runbooks, is “toilsome”. To reduce this toil, we had previously implemented a plethora of solutions to automate runbook actions such as manually-invoked shell scripts, cron jobs, and ad-hoc software services. These had grown organically over time and provided solutions on a case-by-case basis, which led to duplication of work, tight coupling, and lack of context awareness across the solutions.

A good solution would not allow only the SRE team to auto-remediate, it would empower the entire company. The key to adding self-healing capability was a generic interface for all teams to self-service and quickly remediate failures at various levels: machine, service, network, or dependencies.

Temporal is a durable execution platform which is useful to gracefully manage infrastructure failures such as network outages and transient failures in external service endpoints. This capability meant we only needed to build a way to schedule “workflow” tasks and have Temporal provide reliability guarantees.

After a workflow is validated in the staging environment, we can then do a full release to production. It seems obvious, but catching simple configuration errors before releasing has saved us many hours in development/change-related-task time.

Building a system that is maintained by several SRE teams has allowed us to iterate faster, and rapidly tackle long-standing problems. We have set ambitious goals regarding toil elimination and are on course to achieve them, which will allow us to scale faster by eliminating the human bottleneck.

6

Garbage Collection for Systems Programmers (bitbashing.io)

submitted 3 weeks ago by [email protected] to c/[email protected]

0 comments fedilink

Resurfaced in my feed. Obvious in retrospect.

1

An unexpected discovery: Automated reasoning often makes systems more efficient and easier to maintain | Amazon Web Services (aws.amazon.com)

submitted 3 weeks ago by [email protected] to c/[email protected]

0 comments fedilink

7

| Amazon.com, Amazon App, AWS Outage Reported Across U.S. on FridayFrequent Business Traveler (www.frequentbusinesstraveler.com)

submitted 1 month ago by [email protected] to c/[email protected]

0 comments fedilink

We, humanz, are very good in creating SPOFs

Nonfiction readers. Do you feel guilty reading fiction? in c/[email protected]

[–] [email protected] 1 points 1 month ago

Not anymore, nowadays, I feel guilty reading non-fiction and understand Lindy effect on books much better (be it fiction or non-fiction).

Super hero movies should have more scenes of them accidentally maiming people just because of the sheer amount of power they weild. in c/[email protected]

[–] [email protected] 0 points 1 month ago

They cut all such scenes and pasted into The Boys, in a Mark Twain style “Sprinkle these around as you see fit!”.

17

Most Life on Earth Is Dormant, After Pulling an ‘Emergency Brake’ | Quanta Magazine (www.quantamagazine.org)

submitted 1 month ago by [email protected] to c/[email protected]

1 comments fedilink

Many microbes and cells are in deep sleep, waiting for the right moment to activate.

Harsh conditions like lack of food or cold weather can appear out of nowhere. In these dire straits, rather than keel over and die, many organisms have mastered the art of dormancy. They slow down their activity and metabolism. Then, w

Sitting around in a dormant state is actually the norm for the majority of life on Earth: By some estimates, 60% of all microbial cells are hibernating at any given time. Even in organisms whose entire bodies do not go dormant, like most mammals, some cellular populations within them rest and wait for the best time to activate.

“Life is mainly about being asleep.”

Because dormancy can be triggered by a variety of conditions, including starvation and drought, the scientists pursue this research with a practical goal in mind: “We can probably use this knowledge in order to engineer organisms that can tolerate warmer climates,” Melnikov said, “and therefore withstand climate change.”

Balon is notably absent from Escherichia coli and Staphylococcus aureus, the two most commonly studied bacteria and the most widely used models for cellular dormancy. By focusing on just a few lab organisms, scientists had missed a widespread hibernation tactic, Helena-Bueno said. “I tried to look into an under-studied corner of nature and happened to find something.”

“Most microbes are starving,” said Ashley Shade, a microbiologist at the University of Lyon who was not involved in the new study. “They’re existing in a state of want. They’re not doubling. They’re not living their best life.”

“This is not something that’s unique to bacteria or archaea,” Lennon said. “Every organism in the tree of life has a way of achieving this strategy. They can pause their metabolism.”

“Before the invention of hibernation, the only way to live was to keep growing without interruptions,” Melnikov said. “Putting life on pause is a luxury.”

It’s also a type of population-level insurance. Some cells pursue dormancy by detecting environmental changes and responding accordingly. However, many bacteria use a stochastic strategy. “In randomly fluctuating environments, if you don’t go into dormancy sometimes, there’s a chance that the whole population will go extinct” through random encounters with disaster, Lennon said. In even the healthiest, happiest, fastest-growing cultures of E. coli, between 5% and 10% of the cells will nevertheless be dormant. They are the designated survivors who will live should something happen to their more active, vulnerable cousins.

More fundamentally, Melnikov and Helena-Bueno hope that the discovery of Balon and its ubiquity will help people reframe what is important in life. We all frequently go dormant, and many of us quite enjoy it. “We spend one-third of our life asleep, but we don’t talk about it at all,” Melnikov said. Instead of complaining about what we’re missing when we’re asleep, maybe we can experience it as a process that connects us to all life on Earth, including microbes sleeping deep in the Arctic permafrost.

4

Finnish Startup Wants to Build 100x Faster CPUs (spectrum.ieee.org)

submitted 1 month ago by [email protected] to c/[email protected]

0 comments fedilink

Valtonen’s goal is to put CPUs back in their rightful, ‘central’ role. In order to do that, he and his team are proposing a new paradigm. Instead of trying to speed up computation by putting 16 identical CPU cores into, say, a laptop, a manufacturer could put 4 standard CPU cores and 64 of Flow Computing’s so-called parallel processing unit (PPU) cores into the same footprint, and achieve up to 100 times better performance. Valtonen and his collaborators laid out their case at the IEEE Hot Chips conference in August.

3

Six Terrific Books About Decision Making by Non-psychologists (smallpotatoes.paulbloom.net)

submitted 1 month ago by [email protected] to c/[email protected]

0 comments fedilink

Paul Bloom shares six terrific books about decision-making by non-psychologists. These books offer unique perspectives on psychology and insightful approaches to understanding decision-making processes.

Book list:

The Scout Mindset: Why Some People See Things Clearly and Others Don't, By Julia Galef
The Biggest Bluff: How I Learned to Pay Attention, Master Myself, and Win. By Maria Konnikova
Transformative Experience. By Laurie Paul
WIld Problems: A Guide to the Decisions that Define Us, By Russ Roberts
Trying Not to Try: The Art and Science of Spontaneity. By Edward Slingerland
Alchemy: The Dark Art and Curious Science of Creating Magic in Brands, Business, and Life. By Rory Sutherland

Check out the post for the bonus 7th book.

7

What Independent Bookshops Really Sell (www.persuasion.community)

submitted 1 month ago by [email protected] to c/[email protected]

0 comments fedilink

Highlights

America’s independent bookstores may look like the tattered, provincial shops of a bygone era—holding onto their existence by the slimmest thread. And booksellers may appear genial and absent-minded, like characters out of Dickens. But in reality, they’re the marketing geniuses of our time.

In August of last year Publishers Weekly reported, “Bookstore sales finished the first half of 2023 up 6.9% over the comparable period in 2022.” In fact, independent bookstore sales outpaced most other publishing industry metrics in 2023, growing faster than overall unit sales of print books. This is unprecedented.

Booksellers have bent the rules of the free market. For the first time in history, a significant chunk of the buying public are voluntarily paying almost double—and going out of their way—to buy exactly the same product they can get cheaper and often faster somewhere else. And it’s all due to that ABA message: “non-corporate, authentic, and socially responsible.”

What no one says is that the bargain works both ways. If book buyers must behave virtuously and tithe an additional $11 a book, then booksellers must uphold the community’s doctrines. They’re locked in the moral contract, too.

But books are different. They signal something about readers’ intelligence, identity, and closely held ideas. Books confer status—especially among the highly educated. The people who sell them know this and they used it to make their case.

“Most independent bookstores have succeeded because they’ve responded to the needs of their community,” says Jan Weissmiller, co-owner of Iowa City’s Prairie Lights since 2008. “If they’re in a part of the country where people are asking for a certain kind of book, that’s what they have on the shelf. Because they’re a business.”

But what I really want is a store where all the ideas are on display—the socialist, capitalist, monogamous, polyamorous, urban, rural, popular, and reviled—that also has the homely sacrosanct quality of one of Hemingway’s coffee-and-absinthe bars. With great music, please—and no puppets, or cheap pizza.

Where are you buying books?

3

Stumbling on Happiness - by Daniel Gilbert | Derek Sivers (sive.rs)

submitted 2 months ago by [email protected] to c/[email protected]

0 comments fedilink

Genes tend to be transmitted when they make us do things that transmit genes.

Notes of the book. Seems to be a fun one ;) Have you read it?

3

Dangerous memes (www.ted.com)

submitted 2 months ago by [email protected] to c/[email protected]

0 comments fedilink

Every time I happen to consume news - I remember this video.

What book(s) are you currently reading or listening? August 27 in c/[email protected]

[–] [email protected] 2 points 2 months ago

A Tomb for Boris Davidovich - Danilo Kiš

Is it normal for companies these days to solely rely on Amazon RDS backup without another backup strategy? in c/[email protected]

[–] [email protected] 4 points 4 months ago

no

Lessons learned from two decades of Site Reliability Engineering in c/[email protected]

[–] [email protected] 1 points 5 months ago

Reread today again, with some highlights:

Lessons Learned from Twenty Years of Site Reliability Engineering

Metadata

Author: sre.google
Category: article
URL: https://sre.google/resources/practices-and-processes/twenty-years-of-sre-lessons-learned/

Highlights

The riskiness of a mitigation should scale with the severity of the outage

We, here in SRE, have had some interesting experiences in choosing a mitigation with more risks than the outage it's meant to resolve.

We learned the hard way that during an incident, we should monitor and evaluate the severity of the situation and choose a mitigation path whose riskiness is appropriate for that severity.

Recovery mechanisms should be fully tested before an emergency

An emergency fire evacuation in a tall city building is a terrible opportunity to use a ladder for the first time.

Testing recovery mechanisms has a fun side effect of reducing the risk of performing some of these actions. Since this messy outage, we've doubled down on testing.

We were pretty sure that it would not lead to anything bad. But pretty sure is not 100% sure.

A "Big Red Button" is a unique but highly practical safety feature: it should kick off a simple, easy-to-trigger action that reverts whatever triggered the undesirable state to (ideally) shut down whatever's happening.

Unit tests alone are not enough - integration testing is also needed

This lesson was learned during a Calendar outage in which our testing didn't follow the same path as real use, resulting in plenty of testing... that didn't help us assess how a change would perform in reality.

Teams were expecting to be able to use Google Hangouts and Google Meet to manage the incident. But when 350M users were logged out of their devices and services... relying on these Google services was, in retrospect, kind of a bad call.

It's easy to think of availability as either "fully up" or "fully down" ... but being able to offer a continuous minimum functionality with a degraded performance mode helps to offer a more consistent user experience.

This next lesson is a recommendation to ensure that your last-line-of-defense system works as expected in extreme scenarios, such as natural disasters or cyber attacks, that result in loss of productivity or service availability.

A useful activity can also be sitting your team down and working through how some of these scenarios could theoretically play out—tabletop game style. This can also be a fun opportunity to explore those terrifying "What Ifs", for example, "What if part of your network connectivity gets shut down unexpectedly?".

In such instances, you can reduce your mean time to resolution (MTTR), by automating mitigating measures done by hand. If there's a clear signal that a particular failure is occurring, then why can't that mitigation be kicked off in an automated way? Sometimes it is better to use an automated mitigation first and save the root-causing for after user impact has been avoided.

Having long delays between rollouts, especially in complex, multiple component systems, makes it extremely difficult to reason out the safety of a particular change. Frequent rollouts—with the proper testing in place— lead to fewer surprises from this class of failure.

Having only one particular model of device to perform a critical function can make for simpler operations and maintenance. However, it means that if that model turns out to have a problem, that critical function is no longer being performed.

Latent bugs in critical infrastructure can lurk undetected until a seemingly innocuous event triggers them. Maintaining a diverse infrastructure, while incurring costs of its own, can mean the difference between a troublesome outage and a total one.