- Other posts in series
- Introduction
- 7:00, Automating machine learning workflow with DVC, Hongjoo Lee
- 7:30, Tips for Data Cleaning, Hui Ziang Chua
- 9:00, Neural Style Transfer and GANs, Anmol Krishan Sachdeva
- 10:00, Data Visualisation Landscape, Bence Arato
- 10:30, Binder, Sarah Gibson
- 11:00, Data pipelines with Python, Robson Junior
- 12:15, Probabilistic Forecasting with DeepAR, Nicolas Kuhaupt
- 13:15, Fast and Scalable ML in Python, Victoriya Fedotova and Frank Schlimbach
- 14:15 , Small Big Data in Pandas, Dask and Vaex, Ian Ozsvald
- 14:45, IPython, Miki Tebeka
- 15:15, NLPeasy, Philipp Thomann
- 17:45, 30 Golden Rules for Deep Learning Performance, Siddha Ganju
- 19:00, Analytical Functions in SQL, Brendan Tierney
- 19:30, Collaborative data pipelines with Kedro, Tam-Sanh Nguyen
Other posts in series
Introduction
I am attending the online conference EuroPython 2020, and I thought it would be good to record what my thoughts and the things I learn from the talks.
Automating machine learning workflow with DVC, Hongjoo Lee
7:00,Notes from talk
- Works for SK hynix, a memory chip maker in South Korea.
- Waterfall vs agile production. Waterfall = design, then build, then release, then done. Agile, more iterative approach.
- Software dev/dev ops: code git, build sbt maven, test jUnit, release Jenkins, deploy Docker aws, operate Kubernetes, monitor ELK stack
- ML dev lifecycle: get data, preprocess, build model, optimise model, deployment.
- Often getting data requires domain expertise
- Often software engineers are needed for deployment
- Still improvement and modification needed for ML dev workflow. In early stages.
- Data versioning.
- Can be terrible, with names like
raw_data
,cleaned_data
,cleaned_data_final
. - Need to have system where any change in data will trigger pipeline
- Can be terrible, with names like
- ML is metric driven. Software engineering is feature driven.
- ML Models version should be tracked.
- Metrics should be versioned/tracked
- DVC helps. Some laternatives: git-LFS, MLflow, Apache Airflow
- Easy to use
- Language independent
- Useful for individuals and for large teams
- Example code: on github
- Hongjoo works through a cats vs dogs example in talk
- Download template directory
- Use git to keep track of changes
- Use dvc command-line commands to define each of the steps in ml pipeline
- dvc dag command - shows ascii diagram of pipeline
- dvc repro - checks if there are changes, and if so, re-runs pipeline. E.g. change model by adding a layer, then dvc repro does everything. you have to manually to git commit command and naming of versions
- dvc metric show. shows metrics of all versions done.
My thoughts
Looks simple enough! I could follow the talk. :D It is clear that finding a good ML workflow is a big theme. At the end of the conference, I will have to go through these notes and collate the various tools and workflows people use, so I have a reference for when I need to use it.
Tips for Data Cleaning, Hui Ziang Chua
7:30,Notes from talk
- Background. Singaporean. Works at essence. Blog data double confirm.
- Will try to give business centred context - very different from academic/research/learning contetx
- Tasks
- Get column names
- Get size of dataset. Make sure all data has been loaded.
- Check datatypes. Sometimes goes wrong. Sign for data-cleaning.
- Get unique values. Some cases, need to combine different values into one. E.g. ‘Male’ and ‘M’
- Get range of values
- Get count of values. Group-by various columns as appropriate.
- Rename column names. E.g. for merging
- Remove symbols in values. E.g. currency signs
- Convert strings to numeric or to dates
- Replace values with more sensible values. E.g. ‘Male’ vs ‘M’
- Identify variables/columns similar or different across datasets
- Concatenate data. Data from different quarters added to single table.
- Deduplication. Remove duplicate data.
- Merge
- Recoding. Feature engineering
- Data profiling (optional)
- Input missing values (optional)
- Common issues
- inconsistent naming of variables
- bad data formats
- invalid,missing values
- Resources: tinyurl.com/y5b3y7to
My thoughts
It is useful to have a checklist of tasks one should do when they have to clean data. Interesting that they considered the input of missing values as a bonus task - the impression I got from the Kaggle tutorials is that one ought to do some imputation.
Neural Style Transfer and GANs, Anmol Krishan Sachdeva
9:00,Notes from talk
- Recap of GANs
- Discriminative model. Supervised classification model. Fed in data.
- Generative model. Mostly unsupervised. Generates new data by underlying underlying data distribution. Generates near-real looking data. (Conditional generators have some element of supervised learning included). Learning of distribution called implicit density estimation.
- In end, have GAN which takes random input and creates an output that has similar distribution it was trained on.
- Training. Dicriminator gets true entry x and fake input from generator network x*. Generator tries to classify. Compute error and backpropogate. Generator then trained. Similar.
- Book: GANs in Action by Langr and Bok
- GANs have been successful in creating near-real images.
- Style Transfer: Content image + style image gives new image with content from first in style of second image.
- Aim of Style Transfer Networks
- Not to learn underlying distrbution.
- Somehow learn style from style image and embed it into content cimage
- Interpolation is bad
- Used in gaming industry, mobile applications, fashion/design
- Popular networks: Pix2Pix, CycleGAN, Neural Style Transfer
- Neural Style Transfer.
- No training set! Just take as input the two images.
- Example from Neural Algorithm of Artistic Style. Photo turned into painting of certain artist
- Content loss - measure of content between new image and content image
- Style loss - measure of style between new image and style image.
- total variation loss. Check for blurriness, distortions, pixelations
- How is learning done? No trianing set, no back prop?
- Shows example notebook, using TensorFlow and Keras. Imports pre-trained model.
- Content loss.
- Pixel by pixel comparison not done.
- Compare higher level features, obtained from pre-trained model. E.g. VGG19 is 19 layered CNN, which classifies images into 1000 categories. Trained on million images.
- VGG architecture. See Generative Deep Learning by David Foster.
- Keras repo. Neural style transfer.
- Take 3rd layer of block 5 from VGG19 model as measure of higher level features.
- Content loss = mean squared error between this vector encoding of high level features from VGG19 for content image and generated image.
- Style loss
- Take dot product of ‘flattened feature maps’
- Lower level layers represent style.
- GRAM matrix. Dot product of ‘flattened features’ of image with itself.
- Didn’t understand this.
- Loss = MSE (gram matrix(style image), gram matrix(generated image))
- Then take weighted sum over different sets of layers
- Shows example code.
features = K.batch_flatten(K.permute...)
- Take first layer in each block
- Total variation loss
- sum of squared difference between image and image shifted on pixel down, and of shifted one pixel right
- Training
- Add up all the loss
- Use L-BFGS optimisation. Essentially just gradient descent on individual pixel values.
- Example code shown in talk.
- Pix2Pix
- Various cool things you can do. E.g. aerial image to a map. Sketch to full image.
- CycleGAN
- E.g. convert image of apples to oranges. Or horses to zebras!
My thoughts
Excellent talk! I had seen some of the neural style transfer images before, and now I have some understanding of how they are created!
Data Visualisation Landscape, Bence Arato
10:00,Notes from talk
- There are lots of libraries out there.
- Imperative vs declarative. Imperative: specify how something should be done. Declarative: specify what should be done.
- Matplotlib. Biggest example. Has gallery + code examples
- Background in MATLAB
- Challenge: imperative
- Seaborn. Aim to provide higher level package on top of matplotlib. Good defaults for charts that look good.
- Example scatterplot much easier in seaborn than in matplotlib
- plotnine. Aim to higher-level. Based on ggplot2 in R.
- Syntax is basically same as ggplot syntax!
-
aes
,geom_point
etc.
- bokeh. 2nd most widely known tool.
- Big thing is interactivity, like sliders, checkboxes, etc.
- Example scatterplot. Quite long and low-level
- Based on web/javascript background
- HoloViews
- Higher level language for bokeh, matplotlib, plotly
- Just have to change one line code to switch between bokeh or matplotlib output
- Scatterplot example. Short code.
- hvPlot. built on top of holoviews
- pandas bokeh. add plot_bokeh() method in pandas.
- Chartify. From spotify. Built on top of bokeh.
- Plotly.
- Might be only one to do 3d charts well
- Low level charts
- Plotly express
- Higher level version of plotly.
- Vega, vega-lite
- Visualisation ‘grammar’.
- JSON based way of describing charts
- Altair
- High level version of vega
- Dashboards.
- Plotly Dash
- Panel. Anaconda related. Built on top of four big charting libaries above
- Voila. Looks cool - can visualise ML things. Need to check out!
- Streamlit. Datascience specific tools. Example of GAN Image generator with sliders. Change slider will re-run model and create new image.
- PyViz. Thorough list of all visualisation libraries.
- Python Data Visualisation, at AnacondaCON 2020
My thoughts
Excellent talk. Well structured, good examples, good summary of key things I should know about.
Binder, Sarah Gibson
10:30,Notes from talk
- Table of levels of reproducibility
- Same or different data or analysis.
- Rep = same, same
- Replicable = different data
- Robust = different analysis
- Generalisable = diff, diff.
- Repeatable is subset of reproducible. Literally same programs running same data and analysis to get same result.
- Reproducible means getting same result using same data and method - but maybe implemented in different language or program or …
- CI/CD = continous integration / development
- Two big tools needed for repeatable research: dockers and version control. But requires learning. Not for everyone.
- Shows example from ligo about gravitational waves.
- Steps
- Use jupyter notebook
- Upload on public repositiy, e.g. GitHub.
- Describe software needed to run notebook. Binder automatically identifies common configuration files
- Done!
- Brief history of Binder. Now 140,000 sessions per week!
- Binder is open source, can adapt to your own needs. E.g. share only with specific people in your institution.
- Technologies
- Github, clone reposity
- repo2docker. Build docker image based on standard configuration files. Don’t need docker file!
- Docker. Execute docker image
- Jupyter Hub. Allocate resources, make image accessible at url
- Binder, redirect user to the url
- Scaling up
- Created federation
- Highly stable. Uses different kubernetes implementations
- User surveys
- Around 80% of respondants would recommend service
- Most common use case is teaching related. Examples, workings, uni teaching, demos, etc.
- Biggest complaint: needs to be faster to load
- Hard to speed up. But fully explained on jupyter hub blog. Why it is slow but also tips to speed things up.
My thoughts
I had already heard of Binder - because I am friends with the spaker Sarah Gibson! However, the talk was still good, and I learnt more than I already knew. In particular, I liked the classification of different levels of reproducibility.
Data pipelines with Python, Robson Junior
11:00,Notes from talk
- Agenda: Not about code. Anatomy of data product, different architectures, qualities of data pipelines, how python matters.
- Anatomy of data product.
- Ingress. Logs, databases, etc. Volume and variety are important.
- Processes. Processing both input and output data. Veracity and velocity are important here. E.g bank processing payment may have to run fraud detection - has to be very quick!
- Egress. Apis or databases. Veracity is important.
- Lambda and Kappa Architeture
- Above anatomy of data product is same as computer program: input is files and memory, processes in ram, output to screen.
- Lambda. Input data. Processing split into two layers. Speed layer for stream data, real time view. Batch layer, all data, batch views, usually processed periodically. Then output via query. See talk for diagram. Some pros and cons given in talk
- Kappa. Only have a speed layer - no batch layer. Pros and cons given in talk.
- Qualities of a pipeline
- Should be secure. Who has access to which data levels, use common format for data storage, be conscious of who has access to which parts of data and why.
- Should be automated. Use good tools. Versioning, ci/cd, code review.
- Monitoring. ??
- Testable and tracable. Test all parts of the pipeline. Try to containerise tools.
- Python tools
- ELT. Apache Spark, dash, luigi, mrjob, ray
- Streaming. faust based on Kafka streams, streamparse via Apache Storm
- Analysis. Pandas, Blaze, Open Mining, Orange, Optimus
- Mangaement and scheduling. Apache Airflow. Programmatically create workflow
- Testing. pytest, mimesis (create fake data), fake2db, spark-test-base
- Validation. Cerberus, schema, voluptuous
My thoughts
Unfortunately, I found the talk/speaker hard to follow. But I still got a list of tools that I can use as a reference.
Probabilistic Forecasting with DeepAR, Nicolas Kuhaupt
12:15,Notes from talk
- Freelance data scientist. German.
- DeepAR published by Amazon Research
- It is probabilistic, like ARIMA and regression models, but not like Plain LSTMs
- Automatic Feature Engineering. Key point of neural networks! Like Plain LSTMs but unlike ARIMA and regression models.
- One algorithm for multiple timeseries. Seems unlike other algorithms. Like meta learning? Transfer learning??
- Disadvantages: time and resource intensive to train, difficult to set hyperparameters (like most neural networks).
- How it works: inputs time series x. At time t, outputs
z_t
, which are parameters for a probability distribution to predict. Thenz_t
andx_t+1
are inputted into next stage. - Example datasets. Sales at amazons - one time series for each product, sales of magazines in different stores, forecast loads of servers in datacenters, car traffic - each lane has its own time series, energy consumption in households - each household has its own time series.
- Integrated into Sagemaker.
- Provides notebook
- Lots of different tools for different stages of data science workflow. e.g. Ground Truth to use mechanical turk to build data sets. Studio is IDE. Autopilot for training models, Neo for deployment, etc.
- boto3 - access services in aws. s3fs - file storage stuff. various other details in talk
- data inform of json lines, (not pandas)
- start - start time, target - the time series, cat - some categories that timeseries belongs to, dynamic_feat - extra time series (of same length as target). note that different json lines can have time series of different lengths
- hyperparameters. standard stuff, with few extras for deepar.
- code to set up model and train it
My thoughts
Looks like a powerful algorithm. Probabilistic algorithms are definitely the way to go - how else can you manage and predict risk?
Fast and Scalable ML in Python, Victoriya Fedotova and Frank Schlimbach
13:15,Notes from talk
- Python is useful but slow. So often companies hire engineers to translate python to faster languages like C++.
- Intel made a python distribution. No code changes required. * Just-in-time computation. JIT gives big speed boost.
- ML workflow: input and preprocessing via pandas, spark, SDC. model creation and prediction: scikit learn, spark, dl frameworks, daal4py.
- In this talk, talk about SDC, sckit learn, daal4py
- Intel Daal. Data analytics acceleration library. Linear Algebra already sped up (e.g. with MKL from intel), but this new library helps in situations whcih do not use linear algebra - e.g. tree based algorithms.
- Talk gives details on how to install packages and use it.
- Many algorithms have equivalent output to scikit learn algorithms. But not all - e.g. randomforest does not have 100% same output.
- Example of k-means in scikit-learn versus daal4py. Similar structure. Slightly different syntax. e.g.
n_clusters
versusnClusters
. - Scalable dataframe compiler SDC. a just in time compiler. Extension of Numba - made by anaconda.
- Easy to use. Just add decorator
@numba.jit
to function that you want to be compiled. - Explanation of compiler pipeline.
- Talks through basic example of reading file, storing in frame, calculating mean, ordering a column. E.g. reading of file is parallelised, whereas pandas just reads data in single line.
- SDC requires code to be statically compilable - i.e. type stable. Examples where this wouldn’t work.
- Charts showing speed-ups of different operations, as you increase the number of threads. Some operations get good speeds up, and some get mega speeds ups. Most was
apply(lambda x:x)
with 400x speed up. Something to do with lambda function being compiled too, not just apply function.
My thoughts
Looks easy to use and can give big speed boosts. Is there a catch?
Small Big Data in Pandas, Dask and Vaex, Ian Ozsvald
14:15 ,Notes from talk
- Discuss how to speed things up in pandas. Why when there are tools out there? Ought to increase our knowledge and push our speed in current tools.
- Example of company registration data in uk
- Ram considerations
- Strings are slow. Takes up lots of ram.
- In example of company category taking up 300MB. Convert to category type and it takes 4.5 MB. Numeric code stored instead of strings.
- Get speed up on value_counts. 0.485s vs 0.028s.
- Example of settign this column to index and then creating mask based on index. 281ms vs 569 microseconds.
- Try using category for low cardinality data.
- float64 is default and expansive.
- Example of age of company, up to 190 years.
- Use float32 or float16 instead.
- Less RAM. Small time saving. (float16 might actually be slower!)
- Might have loss in precision in data. Depends on data and usage
- Has a tool
dtype_diet
to automate process of optimising dataframe.- Produces table showing how different things can improve RAM usage.
- Saving RAM is good. Can process more data. Speeds things up.
- Dropping to NumPy.
-
df.sum()
versusdf.values.sum()
. - In example, 19.1ms to 2ms.
- Somethings to watch out for, e.g. NaN.
- James Powell produced diagram showing all files and functions called when doing sum in pandas versus in numpy. (Doing this using
ser.sum()
).
-
- Is pandas just super slow.
- Can get big boost just by using bottleneck. see code in example.
- just instal bottleneck, numexpr
- Investigate dtype_diet
- ipython_memory_usage
- Pure python is slow
- Use numba, njit wrapper. See Intel talk above for newer extensions to numba.
- Parallelise with Dask. Overhead may overwhelm benefit. USe profiling and timing tools to check.
- Big time savings come from our own habits
- Reduce mistakes. Try nullable Int64, boolean
- Write tests, unit and end-to-end
- Lots of other examples from blog
- Vaex and Modin. Two other tools.
My thoughts
Excellent talk! Ian clearly knows his stuff. Lots of insights. These are things I should start to implement to get some easy time savings.
IPython, Miki Tebeka
14:45,Notes from talk
- Programmer for 30 years.
- Wrote book, Python Brain Teasers
- Likes ipython. It is a REPL: Real, Eval, Prompt Loop
- Magic commands, via
%
. E.g.%pwd
- Can refer to outputs like variables.
logs_dir = Out[4]
- Command line.
!ls $logs_dir
- Auto-complete features
- He uses Vim! Woo!
- pprint for pretty printing.
-
?
for help.??
for source code -
%timeit function
to give time analysis -
%%timeit
` for multiline stuff - Can do sql stuff. have to install extension. see video for example.
-
%cow IPython Rocks!
Produces ascii art of cow!
My thoughts
Always good to see a live demo to see exactly how somebody does things. I learnt some neat little features. Also, cool to see someone using Vim!
NLPeasy, Philipp Thomann
15:15,Notes from talk
- Co-creator of liquidSVM, Nabu, NLPeasy, PlotVR
- Works at D One. ML consultancy
- NLP. Big progress recently. Word2Vec, Deep learning and many good pre-trained models. Lots of data, many use cases
- Challenges for data scientists.
- NLP is generally harder - high dimensions, specialised pre-processing required, nlp experts/researchers focus only on text but there is usually other data in business.
- Methods have repuation of being hard to use
- standard tools not good for text. e.g. seaborn, tableau
- NLPeasy to the rescue!
- Pandas based pipeline, built in Regex based tagging, spaCy based NLP methods, Vader, scraping with beautiful soup, …
- ElasticSearch. ??
- See Github repo for code, ntoebook example, etc.
- Talk through example, of looking at abstracts from some conference.
- Live demo!
- Scraping EP 2020 data.
- Doing NLPeasy stuff
- Go to Elastic dashboard
- Lots of things shown, possible. E.g. entity extraction, tSNE
- REstaurant similarity using clustering algorithms. based only on reviews! can detect similarities
- Kibana (I think) can produce geoview / heatmap
- Can create networks using entity recognition
- Can try examples in different setups, e.g. Binder, or do it all yourself.
My thoughts
Not much to say. Another tool that I now know about.
30 Golden Rules for Deep Learning Performance, Siddha Ganju
17:45,Notes from talk
- Forbes 30 under 30!
- Recommended book
- 95% of all AI Training is Transfer Learning
- Playing melodica much easier if you already know how to play the piano
- In Neural Network, earlier layers contains generic knowledge, and later layers contain task specific knowledge - at least in CNNs
- So remove last ‘classifier layers’, and classify on new task, keeping first layers as is.
- See github PracticalDL for examples and runnable scripts* Optimising hardware use.
- In standard process, CPU and GPU switch between being idle or active.
- Use a profiler. E.g. TensorFlow Profiler + TensorBoard
- Example code shown to use profiler.
- Simpler: nvidia-smi.
-
We have thing we want to optimise and metric. So how can we optimise. Here comes the 30 rules.
- DATA PROCESSING
- Use TFRecords
- Anti-pattern: thousands of tiny files/gigantic file.
- Better to have handful for large files.
- Sweet spot: 100MB TFRecord files
- Reduce size of input data
- Bad: read image, rezie, train. Then iterate
- Good: read all images, resizes, save as TFRecord. Then read and train - iterating as appropriate.
- Use TensorFlow Datasets
import tensorflow_datasets as tfds
- If you have new datasets, publish your data on TensorFlow datasets, so people can build on your research.
- Use tf.data pipeline
- Example code given
- Prefetch data
- Somehow breaks circular dependency, where CPU has to wait for GPU and vice versa.
- Does asynchronous stuff
- Code given in talk
- Parallelize CPU processing.
- same as number of cpu cores
- Paralleize input and output. interleaving.
- Non-deterministic ordering. If randomising, forget ordering.
- Somehow, not reading files ‘in order’ avoids potential bottlenecks.
- Cache data
- Avoid repeated reading from disk after first epoch
- Avoid repeatedly resizing
- Turn on experimental optimisations
- Autotune parameter values. code given in talk. Seems to refer to hardware based parameters
-
Slide showing it all combined.
- DATA AUGMENTATION
- Use GPU for augmentation, with tf.image
- Still work in progress. Limited functionality. Only rotate by 90 degrees as of now.
-
Use GPU with NVIDIA DALI
- TRAINING
- Use automatic mixed precision.
- Using 8-bit or 16-bit encodings rather than 32 or 64-bit.
- Caveat - fp.16 can cause drop in accuracy, and even loss of convergence. E.g. if gradient is tiny, fp.16 will treat it as zero.
- auto mixed precision somehow deals with this issue
- Use larger batch size.
- Larger batch size leads to smaller time per epoch (but less steps per epoch too, no?)
- Larger batch size leads to greater GPU utilization
- Use batch szies that are multiples of eight. Big jump in performance from 4095 to 4096!
- Video describes other restrictions/options
- Finding optimal learning rate
- Use keras_lr_finder
- ‘point of greatest decrease in loss’ corresponds to best learning rate, somehow. Leslie N Smith paper
- Use tf.function
-
@tf.function
.
-
- Overtrain, then generalise. STart from from dataset and increase.
- Install optimised stack
- Optimise number of parallel threads
- Use better hardware.
- Distribute training. MirroredStrat vs Multiworker…
-
Look at industiral benchmarks.
- INFERENCE
- Use an efficient model. Fewer weights.
- Quantize. 16 to 8bit.
- Prune model, remove weights close to zero
- Used fused operations
- Enable GPU persistence
My thoughts
Very handy list of tips and tricks. Sevreal of them go beyond my understanding, but does not mean I can not benefit from using them!
Analytical Functions in SQL, Brendan Tierney
19:00,Notes from talk
- TALK CANCELLED
My thoughts
Not applicable
Collaborative data pipelines with Kedro, Tam-Sanh Nguyen
19:30,Notes from talk
- Apparently 40% of Vietnamese have surname Nguyen.
- Data engineering is relatively new discipline, so there aren’t established practices.
- QuantumBlack addressed this issue with Kedro.
- Pipelines.
- Kedro viz used to visualise messy data pipeline.
- But how does it get there?
- Starts off simple. Iris data -> cleaning function, cleaned data, analyze function, analyzed data.
- But then splitting data to test/train, gets messy.
- Then have multiple data sources, each which needs to be split, and then all jumbled up
- Without any tools, hard to grow larger and more complex than this.
- But Kedro can allow you to deal with more complex pipelines.
- One source of tension is difference between data engineer and data science
- Data science usually have strong engineering skills. More data modelling skills
- Data science has more research bent / experimental bent / want to be close to data and have many iterations
- Data engineers more like engineers. Focus is on making things tidy, rather than experimentation.
- Two other challenges
- Being ready for production quickly, for business use
- Is it easy to pass the pipeline to future users - which may even be the future you!
- QuantumBlack startup from London, famous for doing work on F1. Got bought by McKinsey
-
Kedro made by QuantumBlack. Big focus on standardisation, and making it as easy as possible for long-term use.
- How does Kedro work
- Analogy - audio world has standardised:
- Input and output tools. Microphones, speakers, etc.
- Functional transformers. Filters, etc.
- Redirecting components. Make it easy for output from one tool easy to input into others.
- Standard organisational conventions. Mic, audio mixer, computer.
- Standard is to use jupyter notebook. Some conventions (e.g. inputs and outputs via
pd.read_csv
andpd.to_csv
), but mostly hard to follow. E.g. many many parameters is hard-coded, e.g. names of files being saved, parameters throughout processing, etc. - Live example:
- install kedro
- kedro new. follow steps, e.g. naming things, etc.
- created default template
- kedro viz to visualise pipeline
- Run out of time, so rushes through lots of kedro features
- Has YouTube series DataEngineerOne
My thoughts
- Looks like an intuitive system. Looks simpler than other pipelines presented in the conference. But is it because it actually is simpler, or is it because I am just getting used to pipelines. (Before this conference, I hadn’t studied pipelines at all).