3 posts tagged with "openapi"

View All Tags

Instant pipelines with dlt-init-openapi

May 28, 2024 · 4 min read

Adrian Brudaru

Open source Data Engineer

Dear dltHub Community,

We are thrilled to announce the launch of our groundbreaking pipeline generator tool.

We call it dlt-init-openapi.

Just point it to an OpenAPI spec, select your endpoints, and you're done!

What's OpenAPI again?

OpenAPI is the world's most widely used API description standard. You may have heard about swagger docs? those are docs generated from the spec. In 2021 an information-security company named Assetnote scanned the web and unearthed 200,000 public OpenAPI files. Modern API frameworks like FastAPI generate such specifications automatically.

How does it work?

A pipeline is a series of datapoints or decisions about how to extract and load the data, expressed as code or config. I say decisions because building a pipeline can be boiled down to inspecting a documentation or response and deciding how to write the code.

Our tool does its best to pick out the necessary details and detect the rest to generate the complete pipeline for you.

The information required for taking those decisions comes from:

The OpenAPI Spec (endpoints, auth)
The dlt REST API Source which attempts to detect pagination
The dlt init OpenAPI generator which attempts to detect incremental logic and dependent requests.

How well does it work?

This is something we are also learning about. We did an internal hackathon where we each built a few pipelines with this generator. In our experiments with APIs for which we had credentials, it worked pretty well.

However, we cannot undertake a big detour from our work to manually test each possible pipeline, so your feedback will be invaluable. So please, if you try it, let us know how well it worked - and ideally, add the spec you used to our repository.

What to do if it doesn't work?

Once a pipeline is created, it is a fully configurable instance of the REST API Source. So if anything did not go smoothly, you can make the final tweaks. You can learn how to adjust the generated pipeline by reading our REST API Source documentation.

Are we using LLMS under the hood?

No. This is a potential future enhancement, so maybe later.

The pipelines are generated algorithmically with deterministic outcomes. This way, we have more control over the quality of the decisions.

If we took an LLM-first approach, the errors would compound and put the burden back on the data person.

We are however considering using LLM-assists for the things that the algorithmic approach can't detect. Another avenue could be generating the OpenAPI spec from website docs. So we are eager to get feedback from you on what works and what needs work, enabling us to improve it.

Try it out now!

Video Walkthrough:

Colab demo - Load data from Stripe API to DuckDB using dlt and OpenAPI

Docs for dlt-init-openapi

dlt init openapi code repo.

Specs repository you can generate from.

Showcase your pipeline in the community sources **here

Solving data engineering headaches in the open source is a team sport. We got this far with your feedback and help (especially on REST API source), and are counting on your continuous usage and engagement to steer our pushing of what's possible into uncharted, but needed directions.

So here's our call to action:

We're excited to see how you will use our new pipeline generator and we are eager for your feedback. Join our community and let us know how we can improve dlt-init-openapi
Got an OpenAPI spec? Add it to our specs repository so others may use it. If the spec doesn't work, please note that in the PR and we will use it for R&D.

Thank you for being part of our community and for building the future of ETL together!

- dltHub Team

Saving 75% of work for a Chargebee Custom Source via pipeline code generation with dlt

March 7, 2024 · 7 min read

Adrian Brudaru & Violetta Mishechkina

Data Engineer & ML Engineer

At dltHub, we have been pioneering the future of data pipeline generation, making complex processes simple and scalable. We have not only been building dlt for humans, but also LLMs.

Pipeline generation on a simple level is already possible directly in ChatGPT chats - just ask for it. But doing it at scale, correctly, and producing comprehensive, good quality pipelines is a much more complex endeavour.

Our early exploration with code generation

As LLMs became available at the end of 2023, we were already uniquely positioned to be part of the wave. By being a library, a LLM could use dlt to build pipelines without the complexities of traditional ETL tools.

This raised from the start the question - what are the different levels of pipeline quality? For example, how does a user code snippet, which formerly had value, compare to LLM snippets which can be generated en-masse? What does a perfect pipeline look like now, and what can LLMs do?

We were only able to answer some of these questions, but we had some concrete outcomes that we carry into the future.

In June ‘23 we added a GPT-4 docs helper that generates snippets

try it on our docs; it's widely used as code troubleshooter

We created an OpenAPI based pipeline generator

Blog: https://dlthub.com/docs/blog/open-api-spec-for-dlt-init
OpenApi spec describes the api; Just as we can create swagger docs or a python api wrapper, we can create pipelines

Running into early limits of LLM automation: A manual last mile is needed

Ideally, we would love to point a tool at an API or doc of the API, and just have the pipeline generated.

However, the OpenApi spec does not contain complete information for generating a complete pipeline. There’s many challenges to overcome and gaps that need to be handled manually.

While LLM automation can handle the bulk, some customisation remains manual, generating requirements towards our ongoing efforts of full automation.

Why revisit code generation at dlt now?

Growth drives a need for faster onboarding

The dlt community has been growing steadily in recent months. In February alone we had a 25% growth on Slack and even more in usage.

New users generate a lot of questions and some of them used our onboarding program, where we speed-run users through any obstacles, learning how to make things smoother on the dlt product side along the way.

Onboarding usually means building a pipeline POC fast

During onboarding, most companies want to understand if dlt fits their use cases. For these purposes, building a POC pipeline is pretty typical.

This is where code generation can prove invaluable - and reducing a build time from 2-3d to 0.5 would lower the workload for both users and our team. 💡 To join our onboarding program, fill this form to request a call.

Case Study: How our solution engineer Violetta used our PoC to generate a production-grade Chargebee pipeline within hours

In a recent case, one of our users wanted to try dlt with a source we did not list in our public sources - Chargebee.

Since the Chargebee API uses the OpenAPI standard, we used the OpenAPI PoC dlt pipeline code generator that we built last year.

Starting resources

POC for getting started, human for the last mile.

Blog post with a video guide https://dlthub.com/docs/blog/open-api-spec-for-dlt-init
OpenAPI Proof of Concept pipeline generator: https://github.com/dlt-hub/dlt-init-openapi
Chargebee openapi spec https://github.com/chargebee/openapi
Understanding of how to make web requests
And 4 hours of time - this was part of our new hire Violetta’s onboarding tasks at dltHub so it was her first time using dlt and the code generator.

Onward, let’s look at how our new colleague Violetta, ML Engineer, used this PoC to generate PoCs for our users.

Violetta shares her experience:

So the first thing I found extremely attractive — the code generator actually created a very simple and clean structure to begin with.

I was able to understand what was happening in each part of the code. What unfortunately differs from one API to another — is the authentication method and pagination. This needed some tuning. Also, there were other minor inconveniences which I needed to handle.

There were no great challenges. The most ~~difficult~~ tedious probably was to manually change pagination in different sources and rename each table.

1) Authentication The provided Authentication was a bit off. The generated code assumed the using of a username and password but what was actually required was — an empty username + api_key as a password. So super easy fix was changing

def to_http_params(self) -> CredentialsHttpParams:
    cred = f"{self.api_key}:{self.password}" if self.password else f"{self.username}"
    encoded = b64encode(f"{cred}".encode()).decode()
    return dict(cookies={}, headers={"Authorization": "Basic " + encoded}, params={})

def to_http_params(self) -> CredentialsHttpParams:
  encoded = b64encode(f"{self.api_key}".encode()).decode()
  return dict(cookies={}, headers={"Authorization": "Basic " + encoded}, params={})

Also I was pleasantly surprised that generator had several different authentication methods built in and I could easily replace BasicAuth with BearerAuth of OAuth2 for example.

2) Pagination

For the code generator it’s hard to guess a pagination method by OpenAPI specification, so the generated code has no pagination 😞. So I had to replace a line

def f():
  yield _build_response(requests.request(**kwargs))

with yielding form a 6-lines get_page function

def get_pages(kwargs: Dict[str, Any], data_json_path):
    has_more = True
    while has_more:
        response = _build_response(requests.request(**kwargs))
        yield extract_nested_data(response.parsed, data_json_path)
        kwargs["params"]["offset"] = response.parsed.get("next_offset", None)
        has_more = kwargs["params"]["offset"] is not None

The downside — I had to do it for each resource.

3) Too many files

The code wouldn’t run because it wasn’t able to find some models. I found a commented line in generator script

# self._build_models()

I regenerated code with uncommented line and understood why it was commented. Code created 224 .py files under the models directory. Turned out I needed only two of them. Those were models used in api code. So I just removed other 222 garbage files and forgot about them.

4) Namings

The only problem I was left with — namings. The generated table names were like ListEventsResponse200ListItem or ListInvoicesForACustomerResponse200ListItem . I had to go and change them to something more appropriate like events and invoices .

The result

Result: https://github.com/dlt-hub/chargebee-source

I did a walk-through with our user. Some additional context started to appear. For example, which endpoints needed to be used with replace write disposition, which would require specifying the merge keys. So in the end this source would still require some testing to be performed and some fine-tuning from the user. I think the silver lining here is how to start. I don’t know how much time I would’ve spent on this source if I started from scratch. Probably, for the first couple of hours, I would be trying to decide where should the authentication code go, or going through the docs searching for information on how to use dlt configs. I would certainly need to go through all API endpoints in the documentation to be able to find the one I needed. There are a lot of different things which could be difficult especially if you’re doing it for the first time. I think in the end if I had done it from scratch, I would’ve got cleaner code but spent a couple of days. With the generator, even with finetuning, I spent about half a day. And the structure was already there, so it was overall easier to work with and I didn’t have to consider everything upfront.

We are currently working on making full generation a reality.

Stay tuned for more, or
Join our slack community to take part in the conversation.

dlt & openAPI code generation: A step beyond APIs and towards 10,000s of live datasets

June 21, 2023 · 3 min read

Matthaus Krzykowski

Co-Founder & CEO at dltHub

Today we are releasing a proof of concept of the dlt init extension that can generate dlt pipelines from an OpenAPI specification.

If you build APIs, for example with FastAPI, you can, thanks to the OpenAPI spec, automatically generate a python client and give it to your users. Our demo takes this a step further and enables you to generate advanced dlt pipelines that, in essence, convert your API into a live dataset.

You can see how Marcin generates such a pipeline from the OpenAPI spec using the Pokemon API in the Loom below.

Part of our vision is that each API will come with a dlt pipeline - similar to how these days often it comes with a python client. We believe that very often API users do not really want to deal with endpoints, http requests, and JSON responses. They need live, evolving datasets that they can place anywhere they want so that it's accessible to any workflow.

We believe that API builders will bundle dlt pipelines with their APIs only if such a process is hassle free. One answer to that is code generation and the reuse of information from the OpenAPI spec.

This release is a part of a bigger vision for dlt of a world centered around accessible data for modern data teams. In these new times code is becoming more disposable, but the data stays valuable. We eventually want to create an ecosystem where hundreds of thousands of pipelines will be created, shared, and deployed. Where datasets, reports, and analytics can be written and shared publicly and privately. Code generation is automation on steroids and we are going to be releasing many more features based on this principle.

Generating a pipeline for PokeAPI using OpenAPI spec

In the embedded loom you saw Marcin pull data from the dlt pipeline created from the OpenAPI spec. The proof of concept already uses a few tricks and heuristics to generate useful code. Contrary to what you may think, PokeAPI is a complex one with a lot of linked data types and endpoints!

It created a resource for all endpoints that return lists of objects.
It used heuristics to discover and extract lists wrapped in responses.
It generated dlt transformers from all endpoints that have a matching list resource (and return the same object type).
It used heuristics to find the right object id to pass to the transformer.
It allowed Marcin to select endpoints using the questionary lib in CLI.
It listed at the top the endpoints with the most central data types (think of tables that refer to several other tables).

As mentioned, the PoC was well tested with PokeAPI. We know it also works with many other - we just can’t guarantee that our tricks work in all cases as they were not extensively tested.

Anyway: Try it out yourself!

We plan to take this even further!

We will move this feature into dlt init and integrate with LLM code generation!
Restructuring of the python client: We will fully restructure the underlying python client. We'll compress all the files in the pokemon/api folder into a single, nice, and extendable client.
GPT-4 friendly: We'll allow easy addition of pagination and other injections into the client.
More heuristics: Many more heuristics to extract resources, their dependencies, infer the incremental and merge loading.
Tight integration with FastAPI on the code level to get even more heuristics!

Your feedback and help is greatly appreciated. Join our community, and let’s build together.

3 posts tagged with "openapi"

Instant pipelines with dlt-init-openapi

What's OpenAPI again?

How does it work?

How well does it work?

What to do if it doesn't work?

Are we using LLMS under the hood?

Try it out now!

Saving 75% of work for a Chargebee Custom Source via pipeline code generation with dlt

Our early exploration with code generation

In June ‘23 we added a GPT-4 docs helper that generates snippets

We created an OpenAPI based pipeline generator

Running into early limits of LLM automation: A manual last mile is needed

Why revisit code generation at dlt now?

Growth drives a need for faster onboarding

Onboarding usually means building a pipeline POC fast

Case Study: How our solution engineer Violetta used our PoC to generate a production-grade Chargebee pipeline within hours

Starting resources

Violetta shares her experience:

The result

We are currently working on making full generation a reality.

dlt & openAPI code generation: A step beyond APIs and towards 10,000s of live datasets

Generating a pipeline for PokeAPI using OpenAPI spec

We plan to take this even further!

DHelp

Ask a question

What's OpenAPI again?​

How does it work?​

How well does it work?​

What to do if it doesn't work?​

Are we using LLMS under the hood?​

Try it out now!​

Next steps: Feedback, discussion and sharing.​

Our early exploration with code generation

In June ‘23 we added a GPT-4 docs helper that generates snippets​

We created an OpenAPI based pipeline generator​

Running into early limits of LLM automation: A manual last mile is needed​

Why revisit code generation at dlt now?

Growth drives a need for faster onboarding​

Onboarding usually means building a pipeline POC fast​

Case Study: How our solution engineer Violetta used our PoC to generate a production-grade Chargebee pipeline within hours

Starting resources​

Violetta shares her experience:​

The result

We are currently working on making full generation a reality.​

Generating a pipeline for PokeAPI using OpenAPI spec​

We plan to take this even further!​

DHelp

Ask a question

What's OpenAPI again?

How does it work?

How well does it work?

What to do if it doesn't work?

Are we using LLMS under the hood?

Try it out now!

Next steps: Feedback, discussion and sharing.

In June ‘23 we added a GPT-4 docs helper that generates snippets

We created an OpenAPI based pipeline generator

Running into early limits of LLM automation: A manual last mile is needed

Growth drives a need for faster onboarding

Onboarding usually means building a pipeline POC fast

Starting resources

Violetta shares her experience:

We are currently working on making full generation a reality.

Generating a pipeline for PokeAPI using OpenAPI spec

We plan to take this even further!